an architectural framework for run-time optimization

43
An Architectural Framework for Run-Time Optimization Matthew C. Merten, Andrew R. Trick, Ronald D. Barnes Erik M. Nystrom, Christopher N. George, John C. Gyllenhaal, Wen-mei W. Hwu Center for Reliable and High-Performance Computing 1308 West Main Street, MC-228, Urbana, IL 61801 merten,atrick,rdbarnes,nystrom,c-george,gyllen,hwu @crhc.uiuc.edu Abstract Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level par- allelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even higher level of per- formance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying run-time optimized code. The mechanism can be viewed as a filtering system, that resides in the retirement stage of the processor pipeline, accepts an instruction ex- ecution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execu- tion of the program except during short bursts of reoptimization. This technique provides a strong platform for run-time optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations. 1 Introduction The development of out-of-order execution and automatic dynamic speculation has led to dramatic improvements in the performance of modern microprocessors. These techniques were the first steps toward allowing the micro- processor itself to determine how to execute code efficiently. Thus far, such hardware optimizations have been limited in scope and have typically been made on the fly without any persistent record. More aggressive and persistent optimizations have instead been accomplished in software through the use of optimizing compilers.

Upload: others

Post on 18-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Architectural Framework for Run-Time Optimization

An Ar chitectural Framework for Run-TimeOptimization

Matthew C. Merten,Andrew R. Trick, RonaldD. BarnesErik M. Nystrom,ChristopherN. George,

JohnC. Gyllenhaal,Wen-meiW. Hwu

Centerfor ReliableandHigh-PerformanceComputing

1308WestMain Street,MC-228,Urbana,IL 61801�merten,atrick,rdbarnes,nystrom,c-george,gyllen,hwu� @crhc.uiuc.edu

AbstractWide-issueprocessorscontinueto achievehigherperformanceby exploiting greaterinstruction-level par-

allelism.Dynamictechniquessuchasout-of-orderexecutionandhardwarespeculationhaveproveneffectiveat increasinginstructionthroughput.Run-timeoptimizationpromisesto provide anevenhigherlevel of per-formanceby adaptivelyapplyingaggressivecodetransformationsonalargerscope.Thispaperpresentsanewhardwaremechanismfor generatinganddeploying run-timeoptimizedcode.Themechanismcanbeviewedasa filtering system,that residesin the retirementstageof theprocessorpipeline,acceptsan instructionex-ecutionstreamasinput, andproducesinstructionprofilesandsetsof linked,optimizedtracesasoutput. Thecodedeploymentmechanismusesanextensionto thebranchpredictionmechanismto migrateexecutionintothe new codewithout modifying the original code. Thesenew componentsdo not adddelayto the execu-tion of theprogramexceptduringshortburstsof reoptimization.This techniqueprovidesa strongplatformfor run-timeoptimizationbecausethe hot executionregionsareextracted,optimized,andwritten to mainmemoryfor executionandbecausetheseregionspersistacrosscontext switches.The currentdesignof theframework supportsa suiteof optimizationsincluding partial function inlining (even into sharedlibraries),codestraighteningoptimizations,loopunrolling,andpeepholeoptimizations.

1 Intr oduction

Thedevelopmentof out-of-orderexecutionandautomaticdynamicspeculationhasledto dramaticimprovements

in theperformanceof modernmicroprocessors.Thesetechniqueswerethefirst stepstowardallowing themicro-

processoritself to determinehow to executecodeefficiently. Thusfar, suchhardwareoptimizationshave been

limited in scopeandhave typically beenmadeon the fly without any persistentrecord. More aggressive and

persistentoptimizationshave insteadbeenaccomplishedin software throughthe useof optimizing compilers.

Page 2: An Architectural Framework for Run-Time Optimization

ThresholdDetection

InstructionExecutionStream

FrequentlyExecuted

Paths

InfrequentExecutionDetection

GuidedBranch

WalkTable

Branch Behavior Buffer

BranchBehavior

Tabulation

Detection

Stream toTable

Correlation

Profiling Mode

Trace GenerationMode

Trace Generation Unit

Hot Spot Detection Counter

Filter Components

Elimination Feedback

Stream to Table Correlation Level

Figure1: Hot SpotDetectorandTraceGenerationUnit depictedasanon-linefilter.

However, many of theseoptimizationsrely on accurateprofile informationto profitably transformcode.While

compilersoftensupporttheuseof profile information,softwarevendorshave beenreluctantto addtheprofiling

stepto their developmentcycles. Not only is determininga representative profile difficult, but whenbeingpro-

filed, thebehavior of certainprogramsmaychange.For thesereasons,anautomatic,transparentmechanismfor

profiling andreoptimizingcodebasedon currentusagewouldbeadvantageous.Furthermore,adynamicsystem

couldimproveperformancein waysthatastaticoptimizercannot.As programbehavior changesover time,ady-

namicsystemcouldreoptimizecodeto take advantageof temporalrelationships.While a typical staticcompiler

optimizesfor the averagebehavior acrossthe entireexecution,a dynamicsystemcould leadto moredirected

optimizations.In addition,sinceoptimizationis performedon thesamemachinethat runsthe application,the

optimizercanspecificallytargetthatsystem.

Hardware supportfor dynamicprofiling and reoptimizationreducesthe relianceon software systemsto

perform thesetasks. The useof dedicated,backgroundhardware mechanismsallows for better limited run-

time overhead.Our hardwaresystemis designedto automaticallyandtransparentlyprofile applicationswhile

generatingsetsof tracesthat cover the frequentlyexecutedpathsof an application. Thesetracescanthenbe

optimizedandexecutedin placeof theoriginalblocksto improveapplicationperformance.Locatedoff theback-

endof theprocessorpipeline,theproposedsystemincursnegligible overheadandminimally impactsthedesign

of theprocessorpipeline.

2

Page 3: An Architectural Framework for Run-Time Optimization

Themechanismitself canbeviewedasafiltering systemthatacceptsaninstructionexecutionstreamasinput

andproducesinstructionprofilesandsetsof tracesasoutput. As shown in Figure1, thefilter consistsof three

primary componentsresponsiblefor: collectingprofile information,determiningtheexecutioncoverageof the

profiledinstructions,andgeneratingtracesfor thefrequentlyexecutedpaths.

During applicationexecution,theBranch BehaviorBuffer componentdeterminesthemostfrequentlyexe-

cutedbranchinstructionswhile collectinga profile of their behavior. TheInfrequentBranch Detectionprocess

eliminatesinfrequentlyexecutedbranchesfrom thetableto filter out thosethatonly spuriouslyexecute.Mean-

while, the Hot SpotDetectionCountercomponentis responsiblefor determiningwhenexecutionis primarily

confinedto thecollectedbranches.TheStreamto TableCorrelationDetectionprocesscomparestheexecution

streamto entriesin thetableandforcesa transitionto TraceGeneration Modefrom Profile Modewhena com-

parisonthresholdis met. Thesetof branchesin thetablewhenthe transitionoccursservesasa skeletonof the

importantregion of code,and is calleda programhot spot. Becausethe detectionprocessis performedvery

quickly at run time,auniqueopportunityexiststo optimizethecurrentlyexecutingcodewhile leaving sufficient

time to gainbenefitfrom executionin thenewly optimizedcode.Programhot spotsprovide anexcellentoppor-

tunity becausethey canbequickly identifiedandcontainonly truly importantcode.Section3 providesevidence

thatprogramsoftenexhibit phasedhot spotexecutionbehavior. TheBranchBehavior Buffer andHot SpotDe-

tectionCountercomponentsarecollectively referredto astheHot SpotDetector[1], andaremorethoroughly

discussedin Section5.

Using the profilesgatheredby the Hot SpotDetector, a hardwarecomponentcalledthe TraceGeneration

Unit [2] producessetsof tracesthatcover thefrequentlyexecutedpathsin thecode.TheGuidedBranch Table

Walk processutilizestheexecutionstreamto walk thehot spotskeletonstoredin thebranchtableto producethe

traces.Theuseof thestreamto stepthroughthetableentrieseliminatestheneedto storeindividual instructions

duringthemonitorphaseandpreventsentriesin thetablethatwereshort-livedandno longeractive from being

included.Meanwhile,theuseof theskeletonensuresthatthetracesarerepresentative of thefrequentlyexecuted

instructions.The instructionsin the output tracesaregeneratedin executionorder, thusproviding an inherent

code-straighteningoptimization. The tracesare linked togetherby their exit branchesto provide continuous

executionwithin theoptimizedhot spot. Theoutputof thedetectorcanbeusedto feedanddirecta hardware-

basedreoptimizationmechanism,or canserve asa profiling platform for a software-basedreoptimizer. One

directapplicationof thismechanismis to performcodeoptimizationsthatimprove instructionfetchperformance

suchasloopunrolling,partialfunctioninlining, andbranchpromotion.Theoutputtracesarestoredin amemory-

basedcodecache,thusenablinga traditionalinstructioncacheto fetchmultipleblockspercycle,andpreserving

3

Page 4: An Architectural Framework for Run-Time Optimization

optimizedhotspotsfor continuedexecution.TheTraceGenerationUnit is discussedmorethroughlyin Section6.

2 RelatedWork

Control-flow profileshave beenshown to provide invaluableinformationto anoptimizerbecausethey give in-

sight into the frequentlyexecutedpathsin theprogram.While profile-driven optimizationssuchassuperblock

scheduling[3] have demonstratedsignificantimprovementsin performance,softwarevendorshave beenreluc-

tant to adda profiling stepto their compilationpath. In somecases,compilershave employed static branch

predictiontechniquesto form estimatedprofileswith moderatesuccess[4] [5]. More recently, anumberof post-

link-time optimizationsystemshave beenproposedthat transparentlyandautomaticallyprofile andreoptimize

programs.However, evenwhenthecostof profiling is minimized,instrumentation-basedprofiling canstill incur

significantoverhead[6]. Low-overheadmethodsof transparentprofiling have beendevelopedbasedon statisti-

cal sampling[7] [8] [9]. Theseapproachessuffer from threeprimarydrawbacks.First, theentireprofile of each

applicationmustbecontinuouslymaintainedat runtimeontheproductionsystem.Second,theprofilerepresents

only averagebehavior over an extendedperiod of time. And third, the latency of detectingvariationsin the

program’s behavior canbegreat.TheproposedHot SpotDetectoraddressesthesedrawbacksby profiling only

themostfrequentlyexecutedinstructions,thosemostlikely to benefitfrom post-linkoptimization.Thedetector

alsothoroughlytracksinstructionbehavior over a shorttime window to provide anaccurateandtimely relative

profile for eachphaseof execution.

Severalmorecomprehensive profiling mechanismshave alsobeenproposedsuchastheProfileBuffer [10].

This hardware table residesin the retirementstageof the microprocessorand consistsof a small numberof

entriesusedto thoroughlytrack branchbehavior. This buffer only requiresservicingby the operatingsystem

whenit becomesfull, thusminimizing its run-timeoverhead.The ProfileMe [11] systemintroducedtwo pri-

mary improvementsover sampling-basedmethods.First, thesystemprovidesa meansfor attributing a variety

of eventsto specificinstructions,whereaspreviouseventcountermechanismscouldonly pinpointoffendingin-

structionswithin ahandfulof cycles.Theability to correlateeventsto instructionsis akey factorfor performing

microarchitecture-specific optimization.Second,thesystemallows for comprehensive randomprofiling of pairs

of instructionsin orderto betterunderstandtheir pipelineinteraction.Like thesesystems,theHot SpotDetector

is ableto correlatebranchbehavior with specificinstructions,but it alsohastheability to automaticallyfilter out

theunimportantbranches.Furthermore,theHot SpotDetectoronly requiresservicingby eithera hardwareor

softwareoptimizerwhena run-timeoptimizationopportunityhasbeendetected.

4

Page 5: An Architectural Framework for Run-Time Optimization

Run-timeprofiling and optimizationof applicationspromisesto deliver higher performancethan is cur-

rentlyavailablewith statictechniques.A numberof software-or firmware-based,dynamicoptimizationsystems

have emerged that optimize running applications,storing the optimizedsequencesinto a portion of memory

for extendedexecution. DAISY [12], FX!32 [13], andmorerecentlyTransmeta[14] andBOA [15], aresys-

temsdesignedto performdynamiccodetranslationfrom onearchitectureto another. Early versionsof DAISY

andFX!32 wereprimarily concernedwith providing architecturalcompatibilitywith subsequentenhancements

targetingperformance,while the Transmetaprocessorsspecificallyutilize dynamictranslationasa meansfor

improving performance.Dynamo[16], Wiggens/Redstone[17], andMojo [18] aresystemsdesignedto improve

applicationperformanceon the samearchitecture.Similarly, just-in-timecompilers[19] have beenintroduced

to generateoptimizedcodeat run time from an intermediaterepresentationsuchasJava bytecode.However,

thesesystemsoften suffer from significantsoftwareoverhead.Many of themutilize interpretationto compre-

hensively profile applicationsbeforegeneratingoptimizedcodesequences.Othersgeneratepoorly optimized

versionswith embeddedprofiling countersto accomplishthesamegoal. However, time spentin an interpreter

or spentexecutingprobedcodeis overheadthatmustbereclaimedwhenexecutingtheoptimizedcodein order

to seea performancebenefit.Thesesystemsalsorequirea softwareexecutive to monitorandcontrolthereopti-

mizationprocess.While ourmechanismusesasimilarmemory-basedcodestoragetechnique,it usesahardware

structurethatoperatesin parallelwith theexecutionof theapplicationfor rapidprofiling, analysis,optimization,

andmanagement.

Similar to Dynamo,the proposedTraceGenerationUnit utilizes the real executionstreamto selecttraces

for optimization[20]. In Dynamo,instructionsareinitially interpretedwhile potentialtracestartingpointsare

thoroughlyprofiled. Onceanexecutionthresholdfor a startingpoint is reached,a traceis formedbeginningat

the startingpoint following the currentexecutionstream.Our proposedmechanismimprovesuponDynamo’s

methodby eliminatingtheoverheadassociatedwith interpretationandprofiling throughperformingtheprofiling

in hardware.Becausetheprofiling overheadis low, all branchescanbeefficiently profiledandusedto guidethe

traceformationprocess.

By taking advantageof optimizationopportunitieschosenat compiletime, dynamiccompilation[21] [22]

hasbeenusedto performoptimizedcodegenerationat executiontime. Run-timeinformation,particularlythe

consistency of values,is usedby a softwaredynamiccompilerto generateoptimizedcodefor regionswhich are

annotatedduringcompilation.Theseregionscanbeselectedautomaticallythroughuseof codeanalysisandpro-

file information[23]. By utilizing hardwaresupport,a similar technique,calledCompiler-directedComputation

Reuse[24], takesadvantageof run-timeconsistentvaluesto eliminateentireregionsof redundantcomputation

5

Page 6: An Architectural Framework for Run-Time Optimization

throughresultmemoization.However, thesetechniquesrely uponregionschosenatcompile-timeandarelimited

by theeffectivenessof theprogrammeror automatedannotationmechanisms.

Dynamicschedulingof instructionsin hardwarehasbeenusedto improve instruction-level parallelismat

run time [25]. By reorderinginstructions,executiontimesideally canbereducedto thecomputationheightof

thecodesequence.Furthermore,thecostof long latency operationscanbehiddenby theconcurrentexecution

of otherinstructions.However, thewindow from which instructionscanbe selectedis typically small,andno

recordis keptof failedreorderingattemptsto preventthemin thefuture. In addition,sincethis form of dynamic

schedulingis locatedin theprocessor’scritical path,thereis limitedopportunityfor moreadvancedoptimizations.

Onepotentialhardwareplatform for dynamicoptimizationis the tracecache[26]. This cachingstructure

storesdynamicallylocal basicblockstogetherinto traces,allowing for the fetch of multiple blocksper cycle.

Optimizationsbeyondblock reorderingin a traditionaltracecachehave beenproposed[27], but thesetransfor-

mationshavebeenlimited to classicaloptimizations.Sinceall theinstructionswithin atracearelikely to execute

together, architectureswheretracesserve as the fundamentalunit of work have beenproposed.In the Trace

Processorarchitecture[28], sequencesof tracesarepredictedanddispatchedasa unit. Sincethey arecached,

fetched,andexecutedasaunit, they provideanopportunityfor moreaggressive optimizationsuchasinstruction

rescheduling[29].

Becausetracesin thetracecacheareshortandoftenhave brief lifetimes,thetracecacheis a limited frame-

work for dynamicoptimization. However, a new mechanismcalledthe FrameCache[30] hasbeenproposed

that forms muchlonger, atomictracesby replacinghighly biasedbrancheswith assertions[31]. Whenever an

assertionfails, theresultof theentiretraceis thrown awayandthearchitecturalstateis revertedto thestateprior

to executionof thetrace.Executionresumesat thefirst instructionin thetrace,but in unoptimizedcode.Essen-

tially, theoptimizedtracesarespeculatively executedin similar fashionto thosein theTraceProcessor. Sincethe

architecturalstateprior to executionof a tracecanbe restoredupona failed assertion,aggressive optimization

ignoring infrequentobstacles,suchassideexits, canbe performed.While theseframesaremuch larger than

tracesin a traditionaltracecache,framesarestill limited by the lengthof thecacheline andby thenumberof

unbiasedbranchesallowedin thetracebecausetheframesarestoredin ahardwarestructure.

3 Program Execution Characteristics

Many applicationsexhibit behavior conducive to run-timeprofiling andoptimization. For example,program

executionoften occursin distinct phases,whereeachphaseconsistsof a setof codeblocksthat areexecuted

6

Page 7: An Architectural Framework for Run-Time Optimization

0

50000

100000

150000

200000

250000

300000

0 10 20 30 40 50 60 70 80 90 100 110 120 130

Execution Sample (10,000 branches per sample)

Bra

nch

Ad

dre

ss

1 32Primary Hot Spots

Figure 2: Importantbranchesexecutedin eachexecutionsamplefor 134.perl. Eachdatapoint representsabranchthatexecutedat least40 timeswithin thesampledurationof 10,000branches.Thesesamplesof 10,000contiguousbranchesaretakenonceevery 2,000,000branches.

with a high degreeof temporallocality. When a collection of intensively executedblocks also hasa small

static footprint, a highly favorableopportunityfor run-timeoptimizationexists. A setof branchesthat define

a collectionof intenselyexecutedblocks is referredto asa programhot spot. A run-timeoptimizercantake

advantageof executionphasesby isolatingandoptimizing thecorrespondinginstructionblocksfor a groupof

hotspots.Ideally, aggressively optimizedcodewouldbedeployedearlyin its phasefor useuntil executionshifts

to anotherphase.Optimizedhotspotsthatarenolongeractivemaybediscarded,if necessary, to reclaimmemory

spacefor newly optimizedcode.They mayalsobesavedfor laterusefor re-occurringhot spots.

3.1 Program Phasing

An exampleof phasingbehavior canbeseenin a perl interpreter(134.perlfrom theSPEC95benchmarksuite)

runninga word jumblescript. As shown in Figure2, this benchmarkcontainsthreeprimary, distinctphasesof

executionwith onehotspotperphase.Hot spot1 runsfor 72million instructions,hotspot2 for 1.35billion, and

hot spot3 for 200 million. The first hot spotinvolvesreadingin a dictionaryandstoringit in an internaldata

structure.Thesecondhot spotprocesseseachword in thedictionary, andthe third scramblesa selectedsetof

wordsin thedictionary. Figure2 alsoindicatesthat,within eachhotspot,theaddressesof theintenselyexecuted

instructionsarenot typically clusteredin onesmalladdressrange,but insteadhavecomponentswidely scattered

throughouttheprogram.

Thesecondhotspotservesasanexcellentexampleof why run-timeoptimizationis needed.Theinputscript

exercisesperl’s split routine,which breaksup an input word into individual letters,andsort, which sorts

7

Page 8: An Architectural Framework for Run-Time Optimization

0

50

100

150

200

250

300

350

400

1 12 23 34 45 56 67 78 89 100

111

122

133

144

155

166

177

188

199

210

221

Sorted Static Branches

Fre

qu

ency

(p

er 1

0,00

0 d

ynam

ic b

ran

ches

)

Figure3: Profile distribution for the secondprimary hot spot in 134.perl. Branchessortedfrom most to leastfrequent.

thoseletters. The first function, split, calls a complicatedregular expressionmatchingalgorithm with an

emptyregularexpression.Becauseexecutionconsistentlytraversesasmallnumberof pathswithin thefunctions

thatcomprisethealgorithm,this regionof codewouldbenefitfrom partialinlining andcodelayout,followedby

classicaloptimization.A staticcompilercouldperformtheseoptimizations,but thelargercodesizeandcompile

timewouldbewastedfor mostinputscripts.Thesecondfunction,sort, callsthelibrary functionqsort which

thencalls a perl-specificcomparisonfunction. Lessthanhalf of the codein the comparisonfunction is ever

executedbecauseonly singlecharactersareactuallysorted.This is anotherexamplewhereinlining is important

becauseof the very frequentcalls to the comparisonfunction. However, a link-time or run-timeoptimizer is

neededto supportinlining acrosslibrary andapplicationboundaries.In addition,if dynamically-linked libraries

areused,inlining wouldhaveto bedelayeduntil applicationloador runtime. Figure3 showsabranchprofile for

a typical 10,000branchsampleof perl’s secondhot spot.All of thebranchesthatexecutein thephasecomprise

its working set. However, the datarevealsthat an even smallernumberof staticbranchesaccountfor the vast

majority of thedynamicinstancesin thesample,for examplebranches0 through122. Theseintenselyexecuted

branchescomprisethehot spot,andeffective optimizationcanbelimited to thatportionof theworkingset.

3.2 Hot SpotCharacteristics

Programsoften containsomefunctionsthatappearin multiple hot spots.This commonlyoccurswith internal

library functionsthat arecalled from different locationswithin the program. Naturally, the behavior of these

functionsmay vary dependingon the calling context. One suchexampleis the function str new from the

8

Page 9: An Architectural Framework for Run-Time Optimization

x85F

x83E

x829

x867

DetectedHot Spot 1

55

55

55

55

550

55

0 55

F

550

55 0

55 0

55

(a)

G

F

E

D

B C

A

x85F

x83E

x829

x867

Hot Spot 2Detected

115

1150

115

102

11

115

0

0

102 13

9111

11

(b)

A

B C

D

E

G

115

115

x85F

x83E

x829

x867

DetectedHot Spot 3

70

?

29

99

099

99

0

0

29 70

??

?

(c)

E

G

D

B C

A

99

99

F

Figure4: Differentdetectedprofilesfor thestr new functionfrom threedifferenthot spotswithin 134.perl.

perl interpreter. This function is presentin threemainhot spotsandhasthreedifferentprofiles,eachof which

weregatheredby themechanismdescribedin thenext section.Figures4a-cdepictthesethreeversionsandare

annotatedby profile weightscollectedfrom the Hot Spot Detector. The dark arrows andblocks indicatethe

importantedgesandbasicblocksasdeterminedby theprofile weights.Note that thex867 branchfrom block

E is missingfrom the profile for Hot Spot3. This situationresultsfrom contentionfor resourceswithin the

hardwaredetector. For blocksthatendin a branch,theblock weight is thebranchexecutionweight,while for

otherblocks,theweightis derivedfrom theknown inputarcs.Theprofileof Hot Spot1 revealsthatbranchx829

in blockA alwaysbranchesto blockC. Branchx829 decideswhetherapreviously freedstringis availablefrom

thestringfreelist, or whetheranew stringmustbecreated.For Hot Spot1, thefreestringlist is emptyeachtime

thefunctionis called.Theinput scriptbeginsby readingadictionaryfile andcreatingnew stringsfor thewords.

Thisoperationrequiresnostringdeletionsandhencenostringsareaddedto thefreelist. Theoppositeis truefor

Hot Spot2, wherea freestringis alwaysavailablefrom thefreelist.

In an effort to betterunderstandthe compositionof hot spots,anotherimportanthot spot was dissected,

exposingits controlflow structure.A portionof theprimaryhotspotfrom alisp interpreter(130.li from SPEC95)

runningthetraininginput script is depictedin thecontrol-flow graphin Figure5. This hot spotrepresents45%

of the program’s dynamicexecution. Control entersthe shown portion of the hot spotat point A in function

9

Page 10: An Architectural Framework for Run-Time Optimization

BBsTen more

B

Two distinct calls

xlygetvalue

ret

E

call

xlgetvalue xlobgetvalue

ret

call

callrec.

cold blockhot blockcold pathhot pathhot call/ret

ret C

indirect call

D

callevform

A

ManymoreBBs

ret

Figure5: Basicblock flow graphfor selectedfunctionsin adominanthot spotin 130.li.

evform, thenperformsnestedcalls to xlgetvalue, xlobgetvalue, andxlygetvalue beforeexiting

thehot spotat point C. Black Boxesandarrows indicatethe intenselyexecutedpaths(thosethatarepartof the

hotspot),while lighterboxesandarrows indicatelessfrequentlyexecuted,or cold,blocksandpaths.Thisentire

hot spotconsistsof 81 branchesspanningapproximately368staticinstructions.Thebranchprofile indicatesa

primarypaththroughthehotspot,as47of the56staticbranches(9.2Mof 10.9Mdynamicbranches)havehighly

consistentdynamicbranchdirection(greaterthan90%in onedirection).Thehot spotis not simply a tight loop;

rather, controlproceedsthrougha numberof functionswith minimal inner loops. Notice, for instance,that the

loopsmarkedby back-edgesB, D, andE iterateonly a few times,if atall, duringeachinvocationof thehotspot.

The hot spotbehavior witnessedin the lisp interpretergeneralizesto otherprogramsaswell. In many of

thehot spotsdetectedacrossa variety of benchmarks,thedynamicnumberof taken branchesdoesnot heavily

outweighthe numberof fall-throughbranches,aswould be thecasefor inner loopswith little internalcontrol

flow. While somehotspotsdocontainsuchloops,many alsohavemuchmorecomplex controlflow. Thenumber

of call andreturninstructionstogetheraverageabout15%of thedynamiccontrolflow in theexaminedprograms.

While thesecontrol transfersareoften highly predictable,their frequency presentsa barrier to wide fetch and

optimization. Many of the benchmarkshave a fair numberof unconditionaljumps, representing,on average,

about5% of dynamiccontrolflow instructions.Theseinstructionsperformno controldecisionsandareobvious

candidatesfor optimization. Indirect jumpsandindirect calls do not make up a large portion of the branches,

but arefrequentenoughto becomehazardsto long traceformationfor somebenchmarks.Inlining thepotential

10

Page 11: An Architectural Framework for Run-Time Optimization

Memory

RetireDecode...ExecuteFetch

Icache

In−order

BTBBranchPredictor Generation

Trace

Unit DetectorSpotHot

Figure6: Instructionsupplyandretirementportionsof theprocessorwith thehot spotdetectionandtracegener-ationhardware.

targetmayallow for wide fetchacrosstheindirect jump or call. Finally, only about35%of thedynamiccontrol

flow instructionsfall-throughto sequentialinstructionaddresses.Consequently, traditional fetch architectures

thatbreakfetchesat takenbrancheswill oftenbelimited to onebasicblock perfetch.

4 Ar chitecture Overview

Our proposedarchitecture,presentedin Figure6, consistsof two primaryhardwarecomponentscalledtheHot

SpotDetectorandtheTraceGenerationUnit. Bothof thesehardwarecomponentsarelocatedoff thecritical path

of theprocessorcorein theretirementstageof thepipeline.TheHot SpotDetectormonitorsretiring instructions,

searchingfor ahot spot.Onceahot spotis found,theTraceGenerationUnit trackstheretirementof subsequent

instructions,utilizing the retirementstreamto constructtracesthat fall within the boundsof the detectedhot

spot. Thesegeneratedtracescollectively containa vastmajority of the frequentlyexecutioninstructionsfor a

particularphaseof programexecution.Oncegenerated,tracescanbeoptimizedandarewritten into acodecache

for executionin lieu of the original program. The codecacheresidesin instructionmemoryso that program

executioncanoccurthroughthetraditionalfetchandexecutemechanisms.

The optimizationprocessbegins asthe Hot SpotDetectorsearchesfor programhot spotsas they emerge

duringexecution.Hot spotdetectionconsistsof two paralleltaskswhich operateon a hardwareprofiling table.

First, retiredbranchesare examinedand entriesaremaintainedin the profiling table for the most frequently

executedbranches.Often,an infrequentlyexecutedbranchwill retire. Suchbranches,however, will beentered

into thetablefor a shorttime while their importanceis monitored,but will laterbeevictedfrom thetable. The

profiler retainsthefrequentlyexecutedbranches,whicharecalledcandidatebranches.

The secondtask is to determinehow closely the collectedbehavior storedin the tablematchesthe actual

instructionstream.This is accomplishedby trackingtheratio of candidatebranchesto non-candidatebranches

in thestream.As morecandidatebranchesaredetectedandplacedinto theprofiling table,theratio of candidate

11

Page 12: An Architectural Framework for Run-Time Optimization

Hot

Hot

Cold

Missing

BranchHot

Inst

ruct

ion

Add

ress

Figure7: Processfor usingbranchprofile informationto form traces.

branchesto non-candidatesincreases.Whenthe ratio exceedsa set thresholdand the ratio is sustainedfor a

setperiodof time, the detectordeclaresthat a hot spothasbeenidentifiedandfreezesall of the entriesin the

profiling table.Upondetection,theTraceGenerationUnit is activatedto constructtracesbasedon thecontents

of the profiling table. However, if the ratio doesnot exceedthe thresholdquickly enough,the entireprofiling

tablewill beflushedandthedetectionprocesswill berestarted.Duringaflush,anumberof thresholdvaluesand

parametersin thedetectorcouldbedynamicallyupdatedto alterthehot spotsearchcriteria.

As described,thedetectoris designedto find, track,andprofile themostfrequentlyexecutedbranches.Fig-

ure7adepictsbranchentriesstoredin theprofiling tableasgraycircles. While profiling, thebranchexecution

anddirectionweightswereaccumulatedto provide behavior informationaboutthebranchesduring thecurrent

executionphase.Theimportant,or hot, branchesanddirectionsareshadedandcomprisea skeletonof the fre-

quentlyexecutedregion of code. The taskof theTraceGenerationUnit is to usetheskeletonto determinethe

instructionblocksassociatedwith thebranches,asshown in figure7b,andto constructa new setof blockscov-

ering the frequentlyexecutedones,asshown in figure7c. While a compilercanperformcontrol-flow analysis

overanentirecoderegion to identify andextracttraces,it is impracticalfor ahardwaremechanismto do thisbe-

causeof timeandspacerequirements.Instead,theTraceGenerationUnit watchesthesequencesof subsequently

retiredinstructionsandconstructstracesfrom thosesequencesthat conformto the boundsof the detectedhot

spot. Furthermore,the TraceGenerationUnit mustbe able to form goodquality traceseven thoughsomeof

theimportantbranchesmaybemissingfrom thetable.Becausethedetectorcanonly profile a finite numberof

branches,a few importantbranchescanbemissingfrom theprofiledueto contentionfor tableentries.

Tracescan be optimizedeither during or following the tracegenerationprocess. This work specifically

12

Page 13: An Architectural Framework for Run-Time Optimization

usesthedetectedhot spotsto guidea codestraighteningandlayoutoptimizationthat targetshigh-performance

instructionissue.

5 Hot SpotDetectorAr chitecture

TheHot SpotDetectoris designedto beageneralmechanismfor findingcoderegionsthatwouldderive themost

benefitfrom run-timeoptimization.Theregionselectionprocessis guidedby threecriteria.First,theregionmust

have a smallstaticcodesizeto facilitaterapidoptimization.Second,theinstructionsin thecoderegion mustbe

active over a minimumtime interval soanopportunityexists to benefitfrom run-timeoptimization. Third, the

instructionsin thecoderegionmustaccountfor alargemajorityof thetotalexecutedinstructionsduringits active

time interval.

The first stepin the processof identifying hot spotsis to detectthe frequentlyexecutingblocksof code.

Employmentof a hardwareschemeeliminatesthe time overheadandallows detectionto be transparentto the

system.TheHot SpotDetectorcollectsthefrequentlyexecutingblocksby gatheringthebranchesthatdefinetheir

boundariesaswell astheir relative executionfrequency andbiasdirection.Thoughnot explicitly constructed,a

control flow graphwith edgeprofile weightscanbe inferredfrom thecollectedbranchexecutionanddirection

information.

Relative to latencieswithin the processorcore, the Hot Spot Detectorcan toleratea large latency before

recordinginformationaboutprogramexecution. For this reason,our proposedhardwareis off thecritical path

andgathersrequiredinformation from the retirementstage. This servesboth the purposeof limiting adverse

affectson theprocessor’s timing while alsopreventingtheneedto handleupdatesfrom speculative instructions.

5.1 Branch Behavior Buffer

To implementhot spotdetectionin hardware,we usea cachestructurecalleda Branch BehaviorBuffer (BBB).

The purposeof the BBB is to collect and profile frequentlyexecutedbrancheswhosecorrespondingblocks

accountfor a vastmajority of thedynamicallyexecutinginstructions.Depictedin Figure8, theBBB is indexed

on branchaddressandcontainsseveral fields (detectorfields shown in white): tag (or branchaddress),branch

executioncount, branch takencount, andbranch candidateflag.

Whentheprocessorretiresa branchinstruction,theBBB is indexedby thebranch’s instructionaddress.If

thebranchaddressis found in theBBB, its executioncounteris incremented.If thebranchis taken, the taken

counteris also incremented.To prevent the executioncounterfrom rolling over, the countersaturates.When

13

Page 14: An Architectural Framework for Run-Time Optimization

−D

+I

Branch Address

Refresh Timer

Reset Timer

=

Touch

ed B

it

Call ID

Code

Cache

FT A

ddr

Code

Cache

T A

ddr

Branc

h Tag

Exec C

ntr

Cand

Flag

Taken

Cnt

r

Branch Behavior Buffer Hot Spot Detection Counter

Taken

Vali

d Bit

FT Vali

d Bit

At Zero:Hot SpotDetected

AdderSaturating

&T/F

0

1

Figure8: Hot SpotDetectorhardware.Hot SpotDetectorfieldsshown in white;additionalfieldsfor tracelayoutshown in gray.

theexecutioncountersaturates,thetakencounteris no longerincrementedin orderto preserve the takenbranch

counterto executedbranch counterratio. As long asthenumberof branchesthat reachsaturationis small, the

profileswill still reflecttherelative importanceof thebranchescollected.

A replacementpolicy mustexist for whena branchaddressis not found in theBBB andtheentry to which

the branchindexes is currently in use. Sincethe BBB must accuratelyrecordthe most frequentlyexecuted

branchesand their profiles ratherthan the most recentlyaccessed,it would be unacceptableto implementan

LRU replacementpolicy andallow a rarebranchto replacea frequentlyexecutedbranch.Instead,branchesthat

conflict with existing BBB entriesaresimply discarded.The function of branchreplacementis controlledby

periodicallyinvalidatingsomeentries.

Theentryof branchesinto theBBB proceedsasfollows. Whenabranchis seenfor thefirst time its behavior

is unknown, makingit necessaryto give thebranchatrial period,gatheraninitial profile,anddetermineits likely

importance.If an entry at its index is available,the branchis temporarilyallocatedinto the BBB andprofiled

overashortinterval calleda refreshinterval. Its executioncountermustsurpassa thresholdcalledthecandidate

thresholdto avoid having its BBB entry invalidatedat the next refresh. A branchthat surpassesthe candidate

thresholdis calleda candidatebranch, for which thecandidateflag in its BBB entryis set,andits entrywill not

beinvalidatedat thenext refresh.

Therefreshinterval is implementedusingasimpleglobalcountercalleda refreshtimer thatincrementseach

timeabranchinstructionis executed.Whentherefreshtimerreachesapresetvalue,all BBB entriesfor branches

thathave not yet surpassedthecandidatethresholdareinvalidated.RefreshingtheBBB flushestheinsignificant

14

Page 15: An Architectural Framework for Run-Time Optimization

entriesandensuresthateachbranchmarkedasacandidateaccountsfor at leastaminimumpercentageof thetotal

dynamicbranchesduringafixedinterval. Theminimumpercentageof executionrequiredof candidatebranches

canbeexpressedasacandidateratio. Thus,

���������� ��� ��� ������������������(1)

giventhatthesizeof therefreshtimer is n bits,andthatthecandidateflag is markedwhenbit b� of theexecution

counteris set.

Sinceentriesfor low weightbranchesareinvalidatedateachrefresh,theBBB only needsto belargeenough

to hold thecandidatebranchesfor a hot spotplusa numberof potentialcandidatebranches.If theBBB is too

small,theinitial allocationof entriesto insignificantbrancheswill delaytheentranceof importantbranchesinto

theBBB. Statistically, the importantbrancheswill eventuallygetentriesafter subsequentrefreshintervals, but

theprofileaccuracy couldbecompromisedandthereportingof hot spotsdelayed.

A moreseriousconflict occurswhentwo importantbranchesindex into thesamecachelocation. Many of

theseconflictscanbeeliminatedby makingtheBBB set-associative. However, someconflictswill still exist. As

long astheremainingconflictsarerelatively rare,a run-timeoptimizercanbedesignedto infer thepresenceof

themissingbranches.If required,eventheprofilesfor thesebranchesmaybeinferred.

5.2 Hot SpotDetectionCounter

Oncecandidatebrancheshavebeenidentifiedin theBranchBehavior Buffer, they mustbemonitoredtodetermine

whetherthe correspondingblocksmay be considereda hot spotand,thus,useful for optimization. We define

a thresholdon the minimum percentageof candidatebranchesthat must executeover a time interval as the

thresholdexecutionpercentage anddefinetheactualpercentageof candidatebranchesexecutedover aninterval

as the candidateexecutionpercentage. Two criteria must be satisfiedbeforea group of candidateblocks is

declareda hot spot. First, the candidateexecutionpercentageshouldequalor surpassthe thresholdexecution

percentage.Second,this high candidateexecutionpercentageshouldbemaintainedfor someminimumamount

of time. Whenthetwo criteriaaremet,adetectionoccurs.

In orderto minimizedisruptionof thesystemduringthehot spotdetectionprocess,we trackthebehavior of

thecandidatebranchesin hardwareusingaHot SpotDetectionCounter(HDC), shown in Figure8. TheHot Spot

DetectionCounteris implementedasa saturatingup/down counterthat is initialized to its maximumvalue. It

countsdown by D for eachcandidatebranchexecutedor countsupby I for eachnon-candidatebranchexecuted,

wherethe valuesof D and I aredeterminedby the desiredthreshold,andwill be discussedlater. When the

15

Page 16: An Architectural Framework for Run-Time Optimization

candidateexecutionpercentageexceedsthe thresholdexecutionpercentage,the counterbegins to move down

becausethe total amountdecrementedexceedstheamountincremented.If thecandidateexecutionpercentage

remainshigherthanthethresholdfor a long enoughperiodof time, thecounterwill decrementto zero. At this

point, a hot spothasbeendetectedanda softwarerun-timeoptimizermaybeinvoked via theoperatingsystem,

or a hardwarerun-timeoptimizer, suchastheTraceGenerationUnit in thenext section,enabled.

The differencebetweenthe candidateexecutionpercentageandthe thresholdexecutionpercentagedeter-

minestherateat which thecounterdecrements(i.e., therateat which thehardwareidentifiesthehot spot).This

correspondsto ourobservationthathotspotsbecomemoredesirableasthey eitheraccountfor alargerpercentage

of total executionor run for a longerperiodof time. It is assumedthathot spotswhich have beenactive over a

longerperiodof time arelesslikely to bespuriousin their executionandaremorelikely to continueto runafter

optimizationhasbeencomplete.

Our experimentshave shown this approachto work quitewell andto detectall of themajorhot spotsin our

benchmarks.Therearethreeprimaryscenarioswherethereis no hot spotto be found,andthustheHDC will

never reachzero:

1. Few branchesexecutewith sufficient frequency to bemarkedascandidates,andcollectively, they do not

constitutea large percentageof the total execution. Thus,even if they wereclassifiedasa hot spotand

optimized,only asmallbenefitis likely to materialize.

2. Thenumberof branchesthatexecutefrequentlyenoughto beconsideredcandidatesis too largeto fit into

theBBB. If thebranchesthatareableto entertheBBB do not accountfor a large enoughpercentageof

execution,they will not beidentifiedasa hot spot.This mayhappenif theexecutionprofile of theregion

is veryflat. Althoughsomebenefitmaybegainedby optimizingall thefrequentbranches,theoverheadof

optimizingsucha largeregion wouldmostlikely beprohibitive.

3. The executionprofile is not consistent. In this case,a small set of branchesmay accountfor a large

percentageof executionover a shorttime,but executionshiftsto a differentregion of codebeforetheHot

SpotDetectionCountersaturates.Optimizinga region of codethatonly executesspuriouslyis unlikely to

yield muchbenefit.

In eachof thesescenarios,somebrancheswereexecutedfrequentlyenoughto warrantcandidatestatusand

thereforeconsiderationfor inclusionin a hot spotandfurtherprofiling. However, if thecollectionof candidate

branchesdoesnot materializeinto a hot spotafter continuedtracking,all branchesmustbe clearedin orderto

begin a freshdetectionprocess. In other words,somebranchesmay have oncebeenimportant,but now are

16

Page 17: An Architectural Framework for Run-Time Optimization

clutteringthedetectionattempt.ThereforetheBBB will beperiodicallypurgedby theresettimer to make room

for new branches.This timer is similar to therefreshtimer but clearsall entriesin theBBB, includingcandidate

branches.The resetinterval shouldbe large enoughto allow the HDC to saturateon valid hot spotsbut small

enoughto allow quick identificationof anew phaseof execution.

It shouldbe noted that the weightsof the blocks and edgesare only valid within a particularhot spot.

Although this profile informationis useful for inferring a control flow graphfor a particularhot spot,profiles

of differenthot spotscannotbe meaningfullycompared.For instance,even if onehot spothasblock weights

twice thatof anotherhot spot,thefirst hot spotis not necessarilyexecutedtwice asoftenasthesecond.These

weightsareprimarily affectedby two factors:refreshperiodsrequiredto detectthehot spot,andfrequency of

theparticularinstructionswithin thehot spot.Thegreaterthenumberof refreshesrequiredto detecta hot spot,

thelongertheprofilesareallowedto accumulate,andthegreatertheweightswill be.

5.3 Hot SpotDetectionParameters

Oncethe thresholdexecutionpercentage�! requiredfor a hot spothasbeenselected,theHDC incrementand

decrementvaluesshouldbechosen.D is thedecrementvaluewhenacandidatebranchis encountered(candidate

hit), andI is theincrementvaluefor abranchthatis not in thetableor is notyetmarkedasacandidate(candidate

miss). Let X be the actualcandidateexecutionpercentage.For a given D and I, the counterwill decrease

when the candidateexecutionpercentagemultiplied by the decrementvalue is greaterthan the percentageof

non-candidatesmultiplied by theincrementvalue.This is representedby theequation:

�#"%$'& �)(+* $',%&-� ( "%$ �(/.�0 (2)

Rearrangingthetermsandsolvingfor X yieldstheformulafor minimumpercentage:

�21

�3*465 � (3)

Equation3 shows that thecounterdecreaseswhenthepercentageof executionis above the threshold,asdeter-

minedby I andD.

Giventheincrementanddecrementvalues,thesizeof theHDC canbechosento achieve aminimumdetec-

tion latency. Let N betheminimumnumberof branchesexecutedbeforea hot spotis detected.For detectionto

occur, thefollowing inequalitymusthold:

� "7�#"8$'& �(9*:� "%$',;&-� ( "8$ �(/. &=< �)� >3� � ? �A@ (4)

17

Page 18: An Architectural Framework for Run-Time Optimization

Thus,thelatency for detectingahot spotis determinedby thefollowing equation:

�B� < �)� >C� � ? ��@$ �D*E�( "%$F�G&-� ( (5)

As the candidateexecutionpercentagefurther surpassesthe threshold,the detectionlatency decreases.The

latency canalsobe decreasedindependentlyof thecandidateexecutionpercentageby increasingI andD such

that � remainsconstant.

6 TraceGenerationUnit Ar chitecture

After hotspotdetection,theBranchBehavior Buffer containsasetof frequentlyexecutedbrancheswhoseblocks

constitutea largefractionof thecurrentoverall instructionexecution.This sectionwill detaila hardware-driven

mechanismthatis capableof automaticallyextractingtraceregionswhosedynamicbehavior closelymatchesthe

hot spotcapturedwithin the BBB andforming tracesfrom that region to permit optimization. This hardware,

referredto astheTraceGeneration Unit (TGU), addsadditionalsystemrequirements:anexpandedBBB, some

associatedregistersandcontrol logic, anda few pagesof reserved virtual addressspacefor eachprocess.Like

theBBB, theTGU is not sensitive to latency (it may lag behindactualinstructionretirement)andshouldhave

little effect on theprocessor’s critical path.

TheTGU remainsidle until thehot spotdetectionhardwaredetectsa suitableprogramhot spot. Following

detection,theTGU is enabledfor ashortperiodof time,duringwhichit constructsasetof tracesfrom thestream

of instructionsretiredby theprocessor. TheTGU spendsmostof this timeoperatingin apassivemode,scanning

the retiredinstructionsfor branchesthatmatchthosecapturedby theBBB duringprofiling. TheTGU actively

generatestracesin shortbursts(experimentsshow 0.005%of total executiontime), writing the instructionsto

a codecache that residesin the virtual memoryspaceof the processbeingoptimized. Otherthana potential

slowdown duringthis active phaseof tracegeneration,theTGU is non-intrusive to programexecution.Because

virtual memoryis usedto containtheoptimizedcode,standardpagingandinstructioncachingmechanismsallow

translationsto persistacrosscontext switches.Thecodecachepagescanbeallocatedby theoperatingsystemat

processinitialization timeandmarkedasread-onlyexecutablecode.

6.1 CodeDeployment

Becauseof codeself-checks,a criterion for our systemis that the original codecannotbe altered in any way.

Thereforeany optimizationsperformedon theprogramareonly performedon thegeneratedhot spottraces.For

18

Page 19: An Architectural Framework for Run-Time Optimization

thesamereason,a seamlessmechanismfor transferringexecutioninto thecodecache,insteadof changingthe

branchtargetsin theoriginalcode,is needed.TheBranchTargetBuffer canbeusedto facilitatecontroltransfers.

Thisstructureassociatesabranchinstructionwith its takentargetby storingthebranch’s targetaddress.Similarly,

our systemutilizestheaddressfield in theBTB to storethetakentarget’s locationin thecodecache. Therefore,

all entrypoint transitionsin oursystemmustoccuralongtakenbranchpaths.

After eachnew traceis constructed,anentrypoint recordfor thetraceis written to a list locatedin thefirst

pageof thecodecache.Therecordassociatestheentrypoint branchin theoriginal codewith its targetplacedin

thecodecache.During the tracegenerationprocess,a timer signalstheendof tracegenerationfor thecurrent

hot spot. At that time, a routineis initiated to install thelist of entrypointsinto theBTB. For eachentrypoint,

theBTB target for theentrypoint branchis updatedwith theaddressof theentrypoint target in thecodecache.

An entrypoint bit is alsosetin theBTB to lock theentryin placeuntil a BTB flush. After a context switch,the

sameroutinecanbeinvoked to reinstalltheentrypointsinto theBTB on a per-processbasis.No new hardware

is required,otherthanthat neededto updateBTB entriesandto ignorebranchaddresscalculationsselectively

for branchesthathave theentrypoint bit set.

Self-modifyingcode(codethatwrites into its own codesegment)presentsa challengeto all dynamicopti-

mizationsystems.Modificationsmadeto theoriginalcodesegmentmustalsobereflectedin theoptimizedtraces.

To prevent optimizedcodefrom becominginconsistentwith the behavior of the modifiedoriginal code,either

instructionsthat have beenmodifiedmustbe updatedor tracesthat containsuchinstructionsmustbe flushed.

Sinceit is not generallypossibleto locateall modifiedinstructionlocationsin theoptimizedcode,systemslike

Dynamoor thosethat employ a tracecacheflush their entirecontentswhenself-modificationis detected.Our

systemmustalsoimmediatelyflushthecontentsof thecodecacheandreturnto unoptimizedcode.

6.2 Trace GenerationOverview

As theTGU writes instructionsinto thecodecache,it performstwo importantfunctions. First, it createscon-

nectedregionsof codethatembodythedetectedhot spotanddefinesentrypointsto thoseregions.It doesthis in

sucha way that if programcontrolentersa hot region at a selectedentrypoint, controlwill likely remaininside

theregion for asignificantlengthof time. Second,theTGU automaticallyperformscodestraighteningalongthe

mostfrequentlyexecutedpaths.

Theprocessof copying aninstructioninto thecodecacheis referredto asremappingtheinstruction.If the

copiedinstructionis a branch,this processcaninvolve changingthe target andpossiblythe senseof the copy

within thecodecache.Thusremappedinstructionor remappedbranch refersto an instructionwithin thecode

19

Page 20: An Architectural Framework for Run-Time Optimization

Mode

From ProfileMode Mode

Scan

FillMode

Pending

1 6Executed suitable entry−point branch Executed direction of candidate branchalready placed, but other direction

2 Both paths from executed candidatebranch already placed

Executed non−candidate branch and3maximum allowed off−path branchesexceeded

Executed return target mismatches call site

5

4

(b)

recursive context

not placed

7 Executed candidate branch matches

8 Executed non−candidate branch

pending target

Rule RuleCondition Condition

New Entry PointEnd

Trace

Merge

Continue

Cold Branch

New Entry Point Transition:

End Trace Transition:

Merge Transition:

Continue Transition:

Cold Branch Transition:

Executed candidate branch will link to

(a)

Figure9: Tracegenerationmodeswith rulesfor mode-alteringtransitions.

cache.If a branchhasbeenremappedin its takendirection, thenaninstanceof thesamestaticbranchhasbeen

placedinto the codecacheandthe instancehashadits taken target changedto point to a locationin the code

cache.

The collectionof tracescreatedfor eachhot spot is calleda traceset. Although an individual tracemay

containinternalbranchesaswell asbranchesto othertracesin thesameset,it never transferscontroldirectly to

codein adifferenttraceset.Therefore,tracesgeneratedfrom ahotspotform aself-containedregionof codethat

canbeindependentlyoptimized,deployed,andremovedif necessary.

To assiston-the-flytraceformation,additionalfieldshavebeenaddedto theBBB, asshown shadedin grayin

Figure8. Thecodecachetakenaddressandfall-throughaddressfieldsareusedto holdoffsetsinto thecodecache

at which codefollowing thecorrespondingbranchdirectionhasbeenplaced.Storingthecodecacheaddresses

for thebranchtargetsin theBBB allows theTGU to link importantbranchesfrom atracedirectly to thetargetof

apreviouslyremappedinstructionin thesametraceset.Valid bits for eachtargetindicatewhetheror not thatpath

hasalreadybeengenerated.By markingthe pathsthat have alreadybeengenerated,the TGU avoids creating

redundanttraces. A callID field alsoaids tracegenerationby taggingthe target fields to a particularcalling

context. This preventslinking codefrom differentcontexts together, a problemthat is discussedin Section6.7.

Finally, a touchedbit is addedto supportthebacktrackingoperationdescribedin Section6.6.

6.3 Trace GenerationControl Logic

Thissectiondescribestheoperationof thestatemachinein Figure9athatcontrolstracegeneration.Moreprecise

treatmentof theTraceGenerationalgorithmis provided in [2]. Figure9b illustratesthedecisionrulesusedfor

eachtransitionarc in the statemachine. The traceformationprocessconsistsof the following four modesof

20

Page 21: An Architectural Framework for Run-Time Optimization

TGU Instruction Candidate Off-Path Calling Executed Other Trans- Next ActionMode Type Branch BranchesH Context Direction Direction ition State

Maximum Already Already RuleAllowed Remapped Remapped

Scan Taken Yes N/A N/A No No 1 Fill Recordentrybranchor point locationin BBB.

jumpDefault Default Scan

Fill Conditional No Yes N/A N/A N/A 3 Scan Endtraceby branchingbranch original code.

Yes N/A Recursive N/A N/A 5 Scan Endtraceby branchingtooriginal code.

Different N/A N/A 9 Fill Placeinst. Overwritecodecachetargetlocationin BBB.

Same No N/A 10 Fill Placeinst. Recordcodecachetargetlocationin BBB.

Yes No 6 Pending Link traceto codecachetarget.Setpendingtargetto otherdirection.

Yes 2 Scan Endtraceby linking tocodecachetargets.

Mismatched N/A N/A N/A N/A N/A 4 Scan Endtracewith return.return

Default Default Fill Placeinst.

Pending Conditional Yes N/A N/A Matchespendingtarget 7 Fillbranch No N/A N/A N/A N/A 8 Scan Endtraceby branchingto

original code.Default Default Pending

Table1: TGU actionsbaseduponstateandBBB entrycontents.

operation:

I ProfileMode: Searchfor anddetectahot spot.

I ScanMode: Searchfor a traceentrypoint. This is theinitial modefollowing hot spotdetection.

I Fill Mode: Constructa traceby writing eachretiredinstructioninto thecodecache.

I PendingMode:Pausetraceconstructionuntil anew pathis executed.

WhentheHot SpotDetectorsignalsthatanew hotspotis available,theTGU transitionsfrom its idle Profile

Modeto ScanMode. In ScanMode,theTGU performsa BBB lookupfor eachretiredtakenbranchinstruction.

If theretiredbranchis listedin theBBB asacandidatebranchthathasnotbeenusedin a trace,theTGU initiates

a new trace.This is accomplishedby settingthecurrenttraceentrypoint to thenext availablecodecacheoffset,

storingthis offset in thecodecachetaken field in the BBB, andtransitioningto Fill Mode (transitionRule 1).

Table1 depictstheseactionsandtransitionstaken basedon theBBB field contents.For example,thefirst row

representsthetransitionto Fill Modewhenasuitableentrypoint is found(Rule1).

21

Page 22: An Architectural Framework for Run-Time Optimization

During Fill Mode, the TGU writes eachretired instruction sequentiallyinto the code cache. All non-

branchinginstructionsarewritten without modification(default fill rule). However, branchinginstructionsre-

quire further treatment.The TGU ignoresunconditionaljump instructionsbecausethe block at a jump target

will befilled into sequentiallocationsimmediatelyfollowing thecopy of thejump’s predecessor. Althoughcon-

ditional branchinstructionsarewritten to the codecache,the TGU may invert the branchsenseif execution

proceedsalongthetakenpath.ConditionalbranchescausetheTGU to performaBBB lookupto determinehow

tracegenerationshouldproceed.

If theBBB lookupfails to locateacandidatebranchentry, theTGU incrementsasmallcounterthatindicates

thenumberof sequentialoff-pathbranchesthathave beenretired.Whenthis counterexceedsa presetthreshold

(Rule 3), the TGU signalsan End Tracecondition and transitionsback to ScanMode. Otherwise,the TGU

continuesto fill instructionsalongtheexecutedpath. For a reasonablysizedBBB, a maximumoff-pathbranch

thresholdof oneallows for anoccasionalmissingbranchentrywhile preventingexcessive tracegenerationdown

coldpaths.

When the BBB lookup returnsa candidatebranchentry, the TGU checkswhetherthe branchhasbeen

remappedin the currentdirection. In the casethat the retired branchwas taken, the TGU checksthe code

cachetaken field in theBBB entry; otherwise,it checksthe fall-throughfield. If the retiredbranchhasnot yet

beenremappedin thecurrentdirection,theTGU continuesto fill instructionsfrom theexecutedpath(Rule9).

Beforeproceeding,however, theTGU markstheremappedtargetfield valid andstoresthecurrenttraceaddress

into theremappedaddressfield. Whenemittingtheconditionalbranch,theTGU usestheremappedaddressfrom

theBBB for theoppositedirectionif it is valid. This reducesthenumberof exit pointsfrom thecodecacheto

theoriginal code.

If the BBB lookup revealsthat the retiredbranchhasalreadybeenremappedin the currentdirection, the

TGU stopsfilling the trace. If the branchhasalsobeenremappedin the oppositedirection(Rule 2), thenthe

TGU signalsanEndTracecondition. At anEndTracecondition,theTGU writes thetrace’s entrypoint to the

codecachefor future insertioninto the BTB andtransitionsbackto ScanMode. To closethe trace,the TGU

emitsanunconditionaljump usingthe remappedaddressfor thedirectionoppositethe retiredbranchdirection

asthejump target.

If theretiredbranch’sexecuteddirectionhasbeenremapped,but not it’soppositedirection(Rule6), theTGU

signalsa Mergecondition.At a Mergecondition,thecurrentbranchis placedin thecodecachesothatits taken

directionlinks to thealreadyremappedcode.Then,theTGU setsthependingtarget registerto thetargetaddress

of thebranchdirectionthathasnot yet beenremappedandtransitionsto PendingMode. In PendingMode,the

22

Page 23: An Architectural Framework for Run-Time Optimization

TGU monitorsretiredconditionalbranches.If theTGU encountersa retiredbranchwhosetarget hasnot been

remapped,it comparesthebranch’s target addresswith thependingtarget register. If both target addressesare

equal(Rule7), theTGU signalsaContinueconditionandtransitionsbackto Fill Mode.

By operatingin PendingModeratherthanendinga trace,theTGU formslonger, morecompletetracesthat

extendbeyondloops.Considerformingatracethatcontainsaninnerloop. WhentheTGU reachestheloopback

edge,it entersPendingMode,becausetheexecuteddirectionof thebranchalreadyhasa valid remappedtarget

to the top of the loop. Whenthecontrol finally exits the loop, theTGU continuesfilling the tracewhereit left

off.

While theTGU operatesin PendingMode,it is possiblefor programexecutionto leave thecodethathasal-

readybeenencounteredwithoutreachingthependingtargetaddress.TheTGU easilyrecoversfrom thissituation

by exiting PendingModeif it encountersanon-candidateor cold branch(Rule8).

6.4 Trace Consistency

Thetracesgeneratedby theTGU areidenticalto theoriginalcodewith theexceptionof controlflow instructions.

The TGU ensuresthe correctnessof all modifiedbranchinstructionsbecauseit never emitsa branchwithout

providing a valid target address.For eachbranchin a trace,theTGU eitherusesthe original target addressof

thebranch,or it usesa remappedaddressthatpointsto a targetpreviously placedinto thecodecache,which is

equivalentto theoriginal code.

TheTGU alsoguaranteesthatall tracesarewell-formedbeforetheprocessorcanbegin executingthem. A

traceis well-formedif all possiblepathsstartingat the traceentry point arevalid. Recall that no entry points

for a tracesetareinstalledinto theBTB until all tracesin thesethave beengenerated.Furthermore,tracesfrom

onesetarenever linked to tracesin anotherset. Therefore,as long asthe currenttraceis closedbeforetrace

generationends,all tracesin thesetwill bewell-formedbeforethey areinstalled.Notethattracegenerationcan

be terminatedby anasynchronousevent suchasa context switchwithout detriment.TheTGU simply emitsa

singleclosingjump instructionif it is interruptedwhile filling a trace.

DuringFill ModeorPendingModeit isdesirablefor executionto remainin theoriginalcodewithoutjumping

into thecodecache.Thispreventsnew tracesfrom containingcopiesof codecacheinstructionsandensuresthat

all exit branchesreturnto theoriginalcode.Therefore,while operatingin thesemodes,theprocessormustignore

BTB entrieswith theentrypointbit set.Thetimespentin Fill ModeandPendingModeis smallenoughthatthis

restrictionis inconsequential.

23

Page 24: An Architectural Framework for Run-Time Optimization

A 1 1

1 1

C 1 1

TkV

1

1

1

0

0

0

Exec Taken C

Execution order during trace generation:

C1 A1 B1 A2 C2 A3 B2 C3 D1

BBB (after trace formation)

120

235

120

235

235

350

0

0

0

TouchCallIDFtAFtVTkA

OriginalCodeLayout

(b)

(d)

CodeOptimized

CacheLayout

(c)

CC−B1

CC−A2

Code

CC−A3

CC−B2

CC−C3

CC−D1

CC−A1

Entrance: C1

B

Block(CC−C2)

Block(CC−A2)

Block(CC−A1) Block(CC−D1)

Block(CC−C3)

Block(CC−B1)

(e)

B

D

C

CC−C2

CC−A2

CC−B1

CC−D1

Entrance: C1

CC−A1

A

(a)

LayoutCache

CC−C2

Figure10: Tracegenerationexample.

6.5 Trace GenerationExample

Thissectionprovidesanexampleof thetracegenerationprocess.Figure10bshowstheoriginalcodelayout.The

label at the endof eachblock of instructionsrepresentsthe staticbranchinstructionthat terminatesthe block.

Figure10alists theexecutionsequenceseenafterenteringScanMode.Thenumberfollowing eachbranchlabel

24

Page 25: An Architectural Framework for Run-Time Optimization

signifiesa dynamicoccurrenceof thatbranch.Thebasictracegenerationmechanismdescribedin theprevious

sectiongeneratesthe traceshown in Figure10c. The staticbranchesin the codecachearedenotedby “CC”

followedby thelabelfor thedynamicbranchthatcausedthetracebranchto beemitted.Theapplicationof two

fetchoptimizations,patching andbranch replication, resultsin thetraceshown in Figure10d. Patchingreduces

prematuretraceexits while branchreplicationperformsmoreaggressive codestraightening,unrolling loopsin

theprocess.Figure10edepictsthefinal contentsof theBBB aftertracegeneration.

Thebranches�

, J , and�

areall candidatebranchesand,therefore,potentialentrypoints. However,�

is

mostlikely to beselectedastheentrypointbecauseit is themostfrequentlytakenbranch.After detectinganew

traceentry point at�

, the TGU remainsin Fill Mode until it reachesdynamicbranch�K�

. Becausethe taken

targetof���

hasalreadybeenremapped,theTGU transitionsto PendingMode. During this time, theprocessor

executesone loop iteration beforereaching��L

. Executionof the fall-throughpath of�

signalsa Continue

condition,andtheTGU reentersFill Mode.TheTGU continuesgeneratingatraceuntil thenumberof sequential

non-candidatebranches(suchas�

) exceedstheoff-paththreshold(Rule3). TheTGU thenclosesthetraceand

returnsto ScanModeto searchfor anothertrace.

6.6 Enhancements

To furtherimprove theissuebandwidthachievedwhile executinginstructionsfrom thecodecache,theTGU was

extendedwith theadditionaloptimizationsexplainedin thissection.Implementationdetailsarecontainedin [2].

Notice from the tracein Figure 10c that executingthe taken path of branchCC-A1 would causecontrol

to exit the codecache. Whenthis happens,the systemexecutesoriginal codeuntil an installedentry point is

encountered.The TGU employs a simpleoptimizationcalledpatching to prevent suchprematuretraceexits.

WhentheTGU emitsbranchCC-A1, it placestheaddressof CC-A1itself in thetakenaddressfield of theBBB

entryfor branch�

, but leavesthefield markedinvalid. WhentheTGU encountersbranch�A�

, it readstheaddress

of CC-A1from theBBB andpatchesCC-A1so that its branchtarget pointsdirectly to the fall-throughpathof

CC-A2, asshown by thedottedline in Figure10d.Similarly, theTGU patchesbranchCC-B1to thefall-through

pathof CC-B2. In general,patchingcanbe performedin Fill Mode only whenthe traceencountersa branch

whosetargethasbeenremappedin thedirectionoppositefrom thedirectioncurrentlyexecuted.

To improve instructionissuebandwidth,it is desirableto eliminateas many taken branchesas possible

without reducinginstructioncacheperformance.Branch replication is a generaloptimizationthathasthedual

effect of bothunrolling small loopsandtail-duplicatingblocksso they exist in multiple traces.Without branch

replication,a traceis filled pasta particularbranchin thesamedirectiononly once. Any subsequentcopiesof

25

Page 26: An Architectural Framework for Run-Time Optimization

thatbranchin thesametracesetareinvertedwith respectto thefirst copy suchthattheir targetaddressespoint to

thefall-throughaddressof thefirst copy. Branchreplication,on theotherhand,allows tracesto continuepastthe

branchmultiple timeswithout linking backto thefirst copy of thebranch.Applicationof this optimizationcan

beseenin Figure10d. Insteadof merging thealreadyremappedtakenpathof CC-C2to Block(CC-A1),CC-C2

is invertedsothat fall-throughpathnow pointsto Block(CC-A3).�

’s BBB codecachetaken targetfield is not

updatedsinceit alreadycontainsa valid target. Becausethe fall-throughpathof�

still hasnot beenseen,the

takenpathfrom CC-C2mustpoint backto Block(D), which is thefall-throughaddressin theoriginal code.

Occasionally, duringFill Mode,actualexecutiondoesnot follow themostfrequentlyexecutedpaththrough

thehot spot. To avoid creationof suboptimaltraces,anoptimizationcalledbacktracking is usedto discardthe

currenttracewhenexecutionfollows acoldpath.

High instructionissueratesareoften limited by the numberof branchesthat canbe predictedin a single

cycle. Onemethodfor overcomingthislimitation is to marktheinstructionwith astaticpredictionvia atechnique

calledbranch promotion. Becausemispredictionsareexpensive, we choseto promoteonly instructionswhose

BBB profile indicates100%executionin onedirection.

TheTraceGenerationUnit is alsocapableof forming tracesacrossindirectjumps.Thenew

JMP INDIRECT INLINE instruction,asdescribedin Section6.9, positionsoneof its indirect targetsimmedi-

atelyfollowing theremappedcopy of thejump. Currently, theselectionof thetargetis baseduponthetargetthat

wasfirst encounteredduringthetraceformationprocess.Onetypical useof indirect jumpsis in thecalling se-

quenceof asharedlibrary function,wherethereis only asingletargetof theindirectinstruction.For multi-target

jumps,a targetprofiling mechanismcouldbebeemployedto selectthemostfrequenttarget.

6.7 Automatic Inlining

Call andreturninstructionsaccountfor a significantportionof controltransfers,andbenefitcanbegainedfrom

inlining frequentlyexecutedcalls.Duringtraceformation,theTGU inlinessubroutinesthatarepartof thecurrent

hot spot. Inlining an individual call site requiresonly minor additionsto the tracegenerationlogic. However,

theTGU mustavoid linking differentinlinedcopiesof thesameoriginalsubroutineto eachother. To do this, the

TGU attachesacall ID to eachinlinedcall sitethatis uniquewithin thecurrenttraceset.TheTGU maintainsthe

next availablecall ID in a register, whichcanberesetupondetectionof anew hotspot.During tracegeneration,

theTGU alsomaintainsacall ID stackthatrepresentsthecurrentcalling context.

In additionto branchinstructions,theTGU identifiescall andreturninstructionsduring Fill Mode. When

theTGU encountersa retiredcall instruction,it emitsa CALL INLINE instruction,describedin Section6.9. It

26

Page 27: An Architectural Framework for Run-Time Optimization

thenpushesa new call ID onto thecall ID stackalongwith the returnaddressof thecall. Similarly, whenthe

TGU encountersa returninstruction,it popsthetop entry from thecall ID stack.If thestackis empty, or if the

returnaddressdoesnot matchthenext retiredinstructionaddress,thentheTGU signalsanEndTracecondition

(Rule4). This is usuallya resultof executinga returninstructionwithouthaving inlined its correspondingcall.

A slight extensionto the BBB accessperformedby the TGU uponretirementof a conditionalbranchhas

beenmadeto supportinlining. WhentheTGU updatestheremappedtargetfield in theBBB, theTGU alsostores

thecurrentcall ID. Additionally, whentheTGU performsa BBB lookup,it comparestherecordedcall ID with

thecurrentcall ID. A remappedtargetis only consideredvalid if thetwo call IDs match.Thiseffectively prevents

aninlinedsubroutinefrom beinglinkedto anothercopy of thesamesubroutinein adifferentcallingcontext. The

BBB lookupalsocomparesthecall ID field with thoseon thecall ID stack.If thebranchentry’s call ID is found

in a previous stackentry (Rule 5), thena recursive call hasbeenmade. In this case,the TGU signalsan End

Traceconditionto avoid recursive inlining.

6.8 Automatic Inlining Example

Considerthe exampleshown in Figure 11. The caller consistsof a single block loop that makes two serial

calls to thesamecallee.The tracegenerationprocessbeginsasnormal,placingtheblock terminatedby MONQPRP �(Block(callA)) into thecodecache.To inline MSNQPFP � , theTGU pushesthenext availablecall ID (1) ontothecall ID

stackalongwith theexpectedreturnaddress(Block(callB)).Next, theTGU insertsaCALL INLINE instruction

in placeof theoriginal call. Whenexecuted,this instructionpushesthe returnaddressof theoriginal call site,

Block(callB), onto the processstack,but allows executionto fall-throughto the next instructionin the trace.

Pushingtheoriginal returnaddressis asafetymechanismthatguaranteesreturnof controlto thecorrectfunction

if executiontakesanearlyexit from thetrace.Thisalsohidestheexistenceof thecodecachefrom programsthat

readthereturnaddressvaluedirectly.

As the TGU fills the body of the inlined subroutine(�

,�

, and“ TVUXW ”), the BBB entriesfor�

and�

are

taggedwith call ID 1. WhentheTGU reachesthereturninstruction,it popsthetop entryfrom thecall ID stack

andverifiesthattheexpectedreturnaddress(Block(callB)matchestheaddressof thenext retiredinstruction.The

TGU thenemitsaRETURN INLINE instructionandcontinuesfilling thetrace.AlthoughtheRETURN INLINE

normallyallowscontrolto fall throughto thenext instructionin thetrace,it still mustcomparethereturnaddress

foundonthestackwith theoriginaladdressof thenext traceinstruction.Thischeckis necessaryin theeventthat

aprogramdirectlymodifiesits returnaddress.If thecomparisonfails, theRETURN INLINE behavesexactlyas

anormalreturninstruction.

27

Page 28: An Architectural Framework for Run-Time Optimization

jmp

E

D

CC−jmp

ret

C

CC−D2

CC−D3

(c) Code Cache Layout

Call ID Stack: 2 3

Block(callF)

Block(E)

Block(E)

Block(callF)

Execution order during trace generation:C1 callA1 D1 E1 ret1 callB1 D2 callF1 D3

Call ID Stack: 2

(b) Original Code Layout

Call ID Stack:

Call ID Stack: 1

CC−ret

CC−D1

CC−E1

callB

callA

CC−callF1

CC−callB1

callF

CC−callA1

(a)

Figure11: Traceexamplewith inlining.

Following theRETURN INLINE instruction,a traceis formedthroughBlock(callB),andinlining of MSNQPRPFJbegins. While inlining thesecondcopy of thesubroutine,thecall ID stackcontainscall ID 2. This preventsthe

TGU from patchingthetakentargetof CC-D1to thefall-throughtargetof CC-D2.

After Block(callF) is filled, anothercall siteto thesamefunctionis encountered.CallF is inlined asbefore,

28

Page 29: An Architectural Framework for Run-Time Optimization

andthenext call ID (3) is pushedontothecall ID stack.Block(CC-D3)is thenfilled into thetrace.WhenaBBB

lookupfor�)L

is performed,theBBB Call ID field value(2) matchesoneof thecall ID stackentriesbelow the

currentstackpointer. Therefore,theconditionfor Rule5 is satisfied,andtheTGU endsthetraceby emittinga

conditionaljump to Block(callF)andanunconditionaljump to Block(E).

6.9 NewInstructions

Thefollowing arenew instructionsthatareusedwithin thecodecache.Theseinstructionsaredesignedto support

run-timeoptimizationin hardwarebut arenotvisibleto theprogrammer. Althoughit mightbepossibleto emulate

theseoperationswith traditionalinstructions,they areproposedto be implementedin themicroarchitecturefor

maximumefficiency.

Y CALL INLINE(returnaddrto original code)

Unlike a normal function call, the programcounteris set to the next sequentialinstruction,and the return

addressis setto theoperandvalueratherthanthenext PC.TheprocessstackandthebranchpredictionReturn

AddressStack(RAS)areproperlymaintainedin caseanormalreturnis executedlater.

Y RETURN INLINE(expectedreturnaddr)

Executionspeculatively continueswith theinstructionsimmediatelyfollowing theRETURN INLINE. Mean-

while, theoperand,which is theexpectedreturnaddress,is comparedto thereturnaddresson thestack.If the

valuesdonotmatch,thenamispredictionoccursandanormalreturnis executed.

Y CALL INDIRECT INLINE(indirect addr, inlinedaddr, returnaddrto original code)

The actualindirect target is calculatednormally andcomparedagainstthe inlined addressoperand.If there

is a mismatchbetweentheactualtargetandthe inlined target,a normalcall is madeto thecalculatedtarget.

Otherwise,a CALL INLINE is performed.In bothcases,theoriginal returnaddressprovidedby anoperand

is pushedontothestack.

Y JMP INDIRECT INLINE(indirect addr, inlined addr)

The actualindirect target is calculatednormally andcomparedagainstthe inlined addressoperand.If there

is a mismatchbetweenthe actualtarget andthe inlined target, a normal jump is madeto the calculatedtar-

get. Otherwise,controlpassesto the instructionimmediatelyfollowing the inlined jump. This instructionis

particularlyusefulfor optimizingacrossDLL boundaries.

29

Page 30: An Architectural Framework for Run-Time Optimization

Benchmark Num. ActionsTracedInsts.

099.go 89.5M 2stone9.intraininginput124.m88ksim 120M clt.in traininginput126.gcc 1.18B amptjp.itraininginput129.compress 2.88B test.intraininginputcountenlargedto 800k130.li 151M train.lsptraininginput (6 queens)132.ijpeg 1.56B vigo.ppmtraininginput134.perl 2.34B jumble.pltraininginput147.vortex 2.19B vortex.in traininginputMSWord(A) 325M open16.0MB .docfile, search,thencloseMSWord(B) 911M load25page.doc,repaginate,word count,

selectentiredoc,changefont, undo,closeMSExcel 168M VB scriptgeneratesSi diffusiongraphsAdobePhoto- 390M loaddetailedtif f image,brighten,Deluxe(A) increasecontrast,andsaveAdobePhoto- 108M exporteddetailedtif f imagetoDeluxe(B) encapsulatedpostscriptGhostview 1.00B loadgsview and9 pagepsfile, view, zoom,

andperformtext extraction

Table2: Benchmarksfor detectionandtracegenerationexperiments.

7 Experimental Evaluation

Trace-driven simulationswereperformedon a numberof applicationsin order to explore the effectivenessof

theHot SpotDetectorandtheperformanceof theTraceGenerationUnit. Theexperimentsfor thedetectorwere

designedto maximizethe programcoverageof the collectedhot spotswhile minimizing both the numberof

spuriousbranchesin the hot spotsandthe latency of detection.Additional experimentsexaminethe ability of

thetracegeneratorto producefrequentlyexecutedtracesoptimizedfor improvedinstructionfetchperformance.

Both SPECINT95andcommonWindowsNT applicationsweresimulatedto provide a broadspectrumof typ-

ical programs. Thesebenchmarksare summarizedin Table2. The eight applicationsfrom the SPECINT95

benchmarksuitewerecompiledfrom sourcecodeusingtheMicrosoft VC++ 6.0compilerwith theoptimizefor

speedandinline wheresuitablesettings.SeveralWindowsNT applicationsexecutingavarietyof taskswerealso

simulated.Theseapplicationsarethegeneraldistribution versions,andthuswerecompiledby their respective

independentsoftwarevendors.

Theexperimentswereperformedusingtheinputsshown in Table2. In orderto extractcompleteexecution

tracesof theseapplications(all usercode,includingstatically-anddynamically-linkedlibraries),weusedSpeed-

Tracer, specialhardwarecapableof capturingdynamicinstructiontraceson an AMD K6 platform. Sincethe

tracedinstructionsarefrom thex86 ISA, variablelengthinstructionsareusedthroughoutsimulation.To ensure

examinationof all executeduserinstructions,samplingwasnotusedduringtraceacquisitionor simulation.

30

Page 31: An Architectural Framework for Run-Time Optimization

Parameter Detection TraceGenerationExperimentsSetting ExperimentsSetting

NumBBB entries 2048 1024BBB associativity 2-way 4-wayExecandtakencntr size 9 bits 9 bitsCandidatebranchthresh 16 16Refreshtimer interval 4096branches 4096branchesCleartimer interval 65535branches 32768branchesHot spotdetectcntr size 13bits 13bitsHot spotdetectcntr inc 2 2Hot spotdetectcntr dec 1 1

Table3: Hardwareparametersettings.

7.1 Hot SpotDetectorEvaluation

In order to evaluatethe performanceof the Hot Spot Detector, a numberof experimentswere conductedto

examinethequalityof thehotspotsproduced.Sincethedetectedhot spotsprovide thebasisfor tracegeneration

andoptimization,maximizingcoverageof thedynamicexecutionis critical. While thetracegenerationprocess

utilizesheuristicsto accountfor anoccasionalmissingimportantbranch,neglectingbrancheswill oftenpreclude

optimizationof pathscontainingthosebranches.However, providing concisehot spotsasswiftly aspossible

ensuresminimal optimizationoverheadandmaximumavailabletime to spendin optimizedcode.

7.1.1 Hot SpotDetectionExperimental Setup

Becausethe designspaceis large, experimentallyevaluatingthe individual effect of eachhardwareparameter

wasinfeasible.We, thereforeselectedinitial parametersthatattemptedto matchtheobservedhot spotbehavior

andthenfurther refinedthem,resultingin parametersthatexhibit desirablehot spotcollectionbehavior. These

parameterswereusedin theexperimentspresentedin this sectionandareshown in Table3. TheBBB hardware

wasconfiguredto allow brancheswith a dynamicexecutionpercentageof 0.4%(16 executions/4096branches)

or higher to becomecandidates(the candidateratio). The HDC was configuredwith a thresholdexecution

percentagewhich requiredthat thecandidatebranchestotal morethan66%(2:1) of theexecutionto becomea

hotspot.TheBBB wasallowed16refreshes(totaling65535branches)to detectahotspotbeforeit wasreset.For

thetracegenerationexperiments,a tablewith fewer entrieswaschosento mitigatethecostof addingadditional

fieldsto eachBBB entry.

31

Page 32: An Architectural Framework for Run-Time Optimization

Benchmark # hot # static % static % total % total Dyn. insts.spots insts.in executed exec. in exec. in in hot

hot spots insts.in hot spots detected spotsafterhot spots hotspots detection

099.go 6 2398 3.46 37.84 35.39 31.7M124.m88ksim 4 1576 2.78 93.03 92.30 110M126.gcc 47 17665 8.90 58.42 52.12 617M129.compress 7 918 2.12 99.93 99.81 2.87B130.li 8 1447 3.00 91.28 90.88 137M132.ijpeg 8 2556 3.48 91.07 91.00 1.42B134.perl 5 1738 2.13 88.43 85.99 2.01B147.vortex 5 2161 1.76 72.30 71.93 1.58B

MSWord(A) 5 3151 1.17 91.36 91.08 296MMSWord(B) 21 12541 2.40 69.13 62.04 566MMSExcel 25 18936 2.94 60.01 54.85 88.2MPhotoD.(A) 20 5485 1.68 94.31 90.97 354MPhotoD.(B) 14 4192 1.78 94.24 90.81 98.5MGhostview 33 8938 2.82 73.39 72.55 2.30B

Table4: Summaryof thehot spotsfoundin thebenchmarks.

7.1.2 Hot SpotDetectionCoverage

Table4 summarizestheeffectivenessof our proposedhardwareat detectingrun-timeoptimizationopportunities

for eachbenchmark.The numberof hot spotscolumn lists the numberof times that the Hot SpotDetection

Countersaturatedat zero,indicatingthe detectionof a new hot spot. The numberof static instructionsin hot

spotsis acloseestimateof thetotalnumberof instructionsthatwill bedeliveredto theoptimizercollectively over

theentireexecutionof theprogram.Thenext column,percentstaticexecutedinstructionsin hot spots, relates

thenumberof staticinstructionsin hotspotsto thetotalnumberof staticinstructionsexecuted.Theresultsshow

that,outof all thestaticinstructionsexecutedby themicroprocessor, only asmallpercentagelie within hotspots.

Notethatsomeinstructionsmaybepresentin morethanonehotspotandarecountedmultipletimesaccordingly.

The portion of total dynamicinstructionsrepresentedby thesehot spotsis shown in the next column,percent

total executionin hot spots. Becausethis hardwarecannotdetecthot spotsinstantly, sometime that could be

spentexecutingin optimizedhot spotsis spentexecutingoriginal codeduringdetection.The time spentin hot

spotsafter they aredetectedis shown in the percenttotal executionin detectedhot spots, andthe time lost to

detectioncanbefoundby takingthedifferencebetweenthis columnandtheprevious column. Finally, the last

column,dynamicinstructionsin hot spotsafter detection, shows thenumberof dynamicinstructionsthatcould

benefitfrom run-timeoptimization.Thisnumberreflectsany subsequentreusesof detectedhot spots.

Analysisof the resultsshows thatonly a smallpercentage,usuallylessthan3%, of thestaticcodeseenby

32

Page 33: An Architectural Framework for Run-Time Optimization

0

100

200

300

400

500

600

700

800

1 2 3 4 5

Hot Spot

Sta

tic H

ot S

pot C

ode

Siz

e

(In

stru

ctio

ns)

0

10

20

30

40

50

60

70

Static Hot ZSpot Size Z% Dynamic [Execution in Detected Hot Spot

0 \200

400

600 ]800 ^1000

1200

1400

1600

1 3 _

5 `

7 a

9 b

11 13 15 17 19 Hot Spot

Sta

tic

Ho

t S

po

t C

od

e S

ize

(I

nst

ruct

ion

s)

0 \2

4

6 ]8 ^10

12

14

16

Static Hot cSpot Size c

%Dynamic dExecution ein Detected Hot Spot

(a) 134.perl. (b) PhotoDeluxe(A).

Figure12: Detailedhot spotstatistics.

themicroprocessorexecutesintensively enoughto becomehot spots.Sincea large percentageof thedynamic

executionis representedby a smallsetof instructions,oftennearly90%of theprogram’s execution,a run-time

optimizercaneasilyfocuson this smallsetwith thepotentialfor significantperformanceincrease.In addition,

only about1% of thepossibletime spentin optimizedhot spotsis lost dueto detection.For example,in 130.li,

thenumberof hotspotstaticinstructionscompriseonly 3%of thetotalstaticinstructions,yieldingatotalhotspot

codesizeof 1447instructions.Furthermore,90.88%of theentireexecutionis spentin detectedhot spots.Our

analysisshows thatideally thehot spotsaccountfor 91.28%of execution,andthusonly 0.40%is lostduringthe

detectionprocess.This indicatesthatourHot SpotDetectormakestheidentificationsoswiftly thattheexecution

of hot spotregionsfallsalmostentirelywithin potentiallyrun-timeoptimizedcode.

Examiningindividual hot spotsrevealsinterestingcharacteristicsof programbehavior. Figure12adetails

thedetectedhot spotsfrom the134.perlbenchmark.For eachhot spot,thebargraphshows thestaticcodesize

of the hot spot,while the line graphshows the percentageof executionspentin that hot spotafter detection.

This benchmarkconsistsof threeprimaryhot spots:hot spots1, 4, and5 on thegraph.Thesecorrespondto the

threehot spotsof Figure2 in Section3 (notethatin thehistogram,134.perlwascompiledfor theIMPACT [32]

architecturewithout inlining). From this histogram,codecanbe seenexecutingbetweenthe first andsecond

primary hot spots,namelyhot spots2 and3. In hot spot3, thecmd exec function loops117k times,calling

str free in eachiteration.The9 blocks,totaling43 staticinstructions,contribute7.3M dynamicinstructions

to the program’s total execution. Becauseof the intensenatureof theseblocks,they serve asgoodcandidates

for run-timeoptimization.Analysisof thisbenchmarkalsoshows thatonehot spotis muchmoredominantthan

theothersin termsof dynamicexecution.In this case,optimizingonly hot spot4 couldbenefitover 58%of the

dynamicinstructionsexecuted.

33

Page 34: An Architectural Framework for Run-Time Optimization

134.perl

223

19 4 10 7 2

f 4 4 0 g

0 g

1 0 g 11

0 g

50 h

100

150

200 f250

2 5 h

10 20 30 i

40 50 h

60 j

70 80 k

90 l

100 NP

Absolute Change Between Detected Taken Percentage and Actual Taken Percentage m

Num

ber o

f Bra

nche

s

MSWord(B) 742 n

74 76 n

64 j

43 22 f

22 f

9 l

7 9 l

1 5 h

2 0 g100

200 f300 i400 o500 h600 j700 n800 k

2 5 h

10 20 30 i

40 50 h

60 j

70 80 k

90 l

100 NP

Absolute Change Between Detected Taken Percentage and Actual Taken Percentage m

Num

ber o

f Bra

nche

s

Figure13: Absolutedifferencebetweendetectedtakenpercentageandactualtakenpercentagefor all significantdetectedhot spots.

Similarcharacteristicswereobservedin theotherbenchmarks.Figure12bshowsanexamplefrom oneof the

precompiledWindowsNT applications,PhotoDeluxe(A). For this benchmark,therearea several hot spotsthat

eachrepresentat least8% of thetotal executionandtogetherrepresentmorethan50%. We alsoseequitea few

hot spotswith smallstaticcodesizes,indicatingtight, intenselyexecutedcode.In fact, for this benchmark,the

smaller-sizedhotspotsarealsothosewith hightotalexecutionpercentages,indicatingexcellentopportunitiesfor

run-timeoptimization.

The benchmark099.gois a notableexampleof a benchmarkwithout obvious hot spots. While this game

simulationrepetitively executesplayers’ moves, eachmove touchesa large amountof static codewith little

temporalrepetition.Thehardwarewasstill ableto detectsix hotspotsrepresenting35%of theexecution.There

is oneprimaryhotspotthatrepresents28%of theexecutionwith astaticcodesizeof 1170instructions.Ourdata

hasshown that the staticsizesof the detectedhot spotsvary significantly, from tensof instructionsto the low

thousands.

7.1.3 Hot SpotDetectionAccuracy

To evaluatetheaccuracy of thehotspotprofiler, thebranchtakenratioscollectedathotspotdetection-timewere

comparedto their averagetakenratiosat thecompletionof theapplication.Thecompletion-timestatisticswere

gatheredby attributing eachdynamicbranchexecutionto theappropriatehotspot.Figure13shows thecompari-

sonbetweenthetwo takenratiosfor two typicalbenchmarks.In thesegraphs,only hotspotsthatrepresented1%

or moreof theprogramexecutionwereconsidered.For eachhotspotbranch,thedifferencebetweenits detected

34

Page 35: An Architectural Framework for Run-Time Optimization

takenpercentageandits accumulatedcounterpartis computedandtallied. Thefigureshows thata vastmajority

of thebrancheshave takenratiosatdetectiontime within 2%of their accumulatedvalues.Theseresultsindicate

accurateprofiling andthereforethe potentialfor usingdetectionprofile weightsfor optimization. However, a

few branchesalsochangetheir behavior dramaticallyascanbeseenby thelargerchanges.Thelastcategory of

thegraph,NP, representsbranchesthatweredetectedaspartof thehot spotbut never executedafterdetection.

Includingthesebrancheswhenperformingoptimizationmayintroduceobstaclesto aggressive optimization.

7.2 Trace GenerationUnit Evaluation

In orderto evaluatetheperformanceof theTraceGenerationUnit, thequalityof thegeneratedtracesis examined.

Becausethegeneratedtraceswill form theunit of compilationfor theoptimizer, it is critical that they cover the

frequentlyexecutedpathsin the hot spot. Furthermore,programperformancewill benefitfrom the formation

of longertracesbecausethey provide the optimizerwith a larger window on which to operate.Optimizations

suchaspartial inlining, loop unrolling,andbranchpromotioncanbeutilized to assistin theformationof longer

traces.In addition,a traceoptimizermay be employed that performsreschedulingandotheroptimizationson

the traces.For this section’s results,optimizationsthat improve instructionfetch performanceon a traditional

fetch mechanismwereemployed, andthe resultingfetch performancecomparedto that of a tracecachefetch

mechanism.

Figure14 depictsanoptimizedtracetakenfrom the130.li benchmark,asshown in Figure5. This particular

tracehighlights the useof partial inlining to form a long tracethrougha complicatedcalling sequence.The

TGU forms a singletracethatbegins prior to thecall of evform, continuesfollowing the hot brancheswhile

inlining bothcallsto xlygetvalue, andreturnsto evform, wheretheTGU terminatesthetracebecausethe

maximumnumberof allowedoff-pathbrancheshasbeenexceeded.This tracecontains284instructionsandhas

10 inlined functioncalls. Its executionaccountsfor 10%of theoptimizedfetchcyclesof theentireapplication

andachieves 15.1 fetch instructions-per-cycle (FIPC) on a 16 instructionissuearchitecture.Our simulations

profiledtheexecutionof thetrace,countingthenumberof traceexits taken. Theseexits areshown betweenthe

blocksin thefigure,while a largenumberof othersideexits thatarenever takenhavebeeneliminated.Clearly, a

largenumberof traceexecutionsproceedatleastthroughthethird block,providing evidencethatoptimizationsto

reducethedependenceheightof thatpathwould increaseexecutionperformance.Consideringthethird branch,

thetraceappearsto havebeengeneratedalongthelessfrequentlyexecutedpath.However, at thetimeof hotspot

detection,thatbranchflowed100%of thetimeto thefourthblock. Thisbranchprovidesevidencethatindividual

branchesmayalsoexhibit phasedbehavior.

35

Page 36: An Architectural Framework for Run-Time Optimization

xlygetvalue

31 Insts

inlined call graph

67 Insts

51 Insts

79 Insts

56 Insts

83k

64k

63k

20k

7k13k

43k

1k

19k

7k

xlevarg

xleval

evform

xlsave

xlgetvalue

xlygetvalue

xlobgetvalue

xcond

xlsave

Call Depth

Figure14: Examplegeneratedtracefrom 130.li optimizedfor instructionfetchperformance.

7.2.1 TraceGenerationUnit Experimental Setup

Instruction-by-instruction simulationsof a sequential-blockinstructionfetch unit were performedfeaturinga

64KB,4-waysetassociative,128-byteline,split-block,10-cyclemisspenaltyL1 ICache.TheL2 ICacheconsists

of a 512KB, 2-way setassociative, 256-byteline, split-block,100-cycle misspenaltycache.Somefetch units

werealsocoupledwith tracecachesfeaturingeither128(8KB) or 2048(128KB),4-waysetassociative,64-byte

lines. Table5 summarizesthe variousconfigurations.The tracecachesareallowed to form tracescontaining

instructionsfrom thecodecache.

ThesimulatedICachemodelhasa split-blockconfigurationsuchthateachline is dividedinto two banks.If

a requestfalls into thesecondbank,thefirst bankof thesubsequentcacheline is alsoreturned,if present.The

instructionbuffer is capableof deliveringup to sixteeninstructionspercycle to thedecoders,but will not issue

36

Page 37: An Architectural Framework for Run-Time Optimization

Model L1 ICache TraceCache HSDandsize,way,block size,way,block,biastable TGU

Traditional 64KB, 4, 128B none nowith TGU 64KB, 4, 128B none yes

TraceCache 64KB, 4, 128B 8KB, 4, 64B,4KB noTC with TGU 64KB, 4, 128B 8KB, 4, 64B,4KB yesTraceCache 64KB, 4, 128B 128KB,4, 64B,16KB no

TC with TGU 64KB, 4, 128B 128KB,4, 64B,16KB yesTraditional 128KB,4, 128B none no

Table5: Fetchmechanismmodels.

instructionspasta taken branch.Up to threebranchesmaybe issuedpercycle, andany instructionsin thefill

buffer that fall after the third branchwill not be useduntil they areverified to be on the predictedpath. The

ICacheassumespredecodeinformationto identify instructionboundariesandbranches.

A 14-bit-historygsharebranchpredictor[33] is modeledwith a patternhistory tableconsistingof entries

with seven2-bit counters,togethercapableof threepredictionspercycle. In additionto theconditionalbranch

predictor, a32-entryreturnaddressstackanda 1024-entryindirectaddresspredictorareprovided.Wemodelan

idealBTB to isolatetheeffect of storingentrypointsin theBTB. Theentrypoint replacementpolicy hasbeen

deferredfor futurework.

Wealsomodela tracecachethatis indexedon thetrace’s startingaddressandallows partialmatches(it has

theability to fetchthebeginningof a traceupto apredictionmismatch).Bothtracecachemodels(8KB, 128KB)

arecoupledwith anICacheandusethesamebranchpredictorastheICache.Whenafetchrequestis made,both

unitsareaccessedin parallel;a tracecachehit alwaystakesprecedenceover anICachehit, andonly whenboth

cachesmissis theL2 ICacheaccessed.Thetracecacheis block-basedandis modeledafter thedesignin [34].

Eachcacheline is 64 byteswide with slotsfor 16 instructionsandup to 3 branches.Four target addressesare

storedin the line to provide the next fetch addressin caseof partial matching. Tracesendwhenthe limit on

instructionsor branchesis reached,or whenan indirect branchinstructionis encountered.The tracesarebuilt

in basic-blockgranularityunlessmorethanhalf of the line will bewasted,in which casepartialblocksmaybe

filled. The tracecachealsoutilizesa BranchBiasTable(BBT) of 1024or 4096entries(approximately4 bytes

each)to facilitatebranchpromotionwithin traces.Includingtheadditionaltargetaddressesandtagstoredin each

cacheline, thecombinedsizefor the8KB tracecacheand1024-entryBBT is approximately15KB.

37

Page 38: An Architectural Framework for Run-Time Optimization

0 p0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

go

m88

ksim

gcc

com

pre li

ijpeg

perl

vort

ex

Wor

d(A

)

Wor

d(B

)

Exc

el

PD

(A)

PD

(B)

Gsv

iew

Frac

tion

of T

aken

Con

trol

Flo

w

Returns

Indirect calls qCalls rIndirect jmps sUncond. jmps sTaken branches t

Figure15: Reductionin takencontrol-flow instructionsin optimizedcodecomparedto original code.

7.2.2 Performanceof Optimized Traces

Figure15 summarizesthereductionin takencontrol-transferring instructionsdueto thecodestraighteningand

fetchoptimization.Eachpairof barsfor abenchmarkis normalizedto 100%of thetakencontroltransfersin the

original code.Thebarsfor theoptimizedapplicationsincludetakencontrol transfersfrom boththecodecache

andtheoriginal code. On average,a 45%reductionis seenacrossthebenchmarks,verifying theeffectiveness

of thecode-straighteningtechniques.Noticethatcall andreturninlining is particularlyeffective, removing 25%

of the taken control transfersin 147.vortex, andsizeableamountsin theotherbenchmarks.Codestraightening

techniquesfor theconditionalbranchesyield anaverage24%reductionin takencontrol transfers,andasmuch

as40% for MSWord(A), PhotoDeluxe(A), andPhotoDeluxe(B). Theseresultsshow a dramaticreductionin the

numberof takenbranches.

Table6 presentsthe resultsof thehot spotdetectionandtracegenerationsystemwith fetchoptimizations.

A large percentageof dynamicexecutionoccursin instructionsfrom the codecache. Typically lessthanone

percentof executionis spentlookingfor tracesto form within ahotspot(Scan/PendingModes),andaverysmall

percentageis spentactuallywriting to thecodecache(Fill Mode),often lessthan0.005%.Even if thewriting

processrequiresaseveralcyclespercodecacheinstruction,thetotaloverheadwouldbewell under0.1%.

To evaluatethe effectivenessof the layout optimizations,eachbenchmarkwassimulatedwith several dif-

ferent fetch unit configurations.Figure 16 shows the performanceof the variousfetch mechanisms.As our

optimizationsweretargetedtowardthefetchunit, theFetchedInstructionsPer Cycle(FIPC)metricwasselected

38

Page 39: An Architectural Framework for Run-Time Optimization

Benchmark % Insts. % Scan/ % Fill Code Entryfrom Code Pending Mode Size(KB) Points

Cache

go 10.51 1.02 0.0051 14.8 60m88ksim 68.91 4.98 0.0027 8.6 49gcc 32.01 1.07 0.0063 135.1 715compress 87.05 0.84 0.0001 6.4 30li 74.32 0.61 0.0032 13.6 59ijpeg 84.44 0.09 0.0005 22.2 57perl 72.34 0.04 0.0002 12.4 69vortex 34.08 0.12 0.0006 26.4 103Word(A) 78.46 0.08 0.0014 10.2 37Word(B) 45.66 0.29 0.0040 73.9 330Excel 30.69 3.12 0.0271 87.6 352PD(A) 86.38 0.58 0.0030 18.9 105PD(B) 81.15 1.25 0.0107 19.1 101Gsview 60.15 0.35 0.0027 61.0 336Average 60.44 1.03 0.0048 36.4 172

Table6: Benchmarkdetectionandtracegenerationresults.

asanappropriategaugeof effectiveness.Thebarsof thegraphcomparetheFIPCof variousprocessormodels

to a baselineconfigurationof an aggressive multiple-blockfetch unit operatingon the original code. The first

bardepictstheFIPCfor aprocessorwith hot spotdetectionandtracegenerationhardware,which averages22%

improvementover thebasecase.Theimprovementachievedby a comparablysizedtracecacheis 18%. Adding

the tracecachein additionto our proposedhardwareyieldsa benefitof 25% over thebasecase.With a much

largertracecache,approximately15timeslargerin sizethanthehotspothardware,theFIPCis improvedto 32%

over base,and39%if hot spothardwareis alsoincluded.Doublingthesizeof thetraditionalinstructioncache

hasasignificantlyhigherhardwarecostthanthehotspothardware,andrealizesonly a1%improvement.Despite

the large reductionin taken control transfersshown in Figure15, FIPC doesnot necessarilyscaleaccordingly.

This is primarily becausebranchmispredictionscauseadramaticnumberof stall cycles,which lessenstheeffect

of improving throughputduringusefulcycles.

Oneadvantageof our systemover a tracecacheis its ability to inline returnsandindirect branchesaswas

seenin Figure14. Our tracesmay also include loops,aswasshown in Figure10. Conveying loop structure

potentiallyallows betteroptimizationthancouldbeperformedwith simplertraces.

39

Page 40: An Architectural Framework for Run-Time Optimization

0.85

1.00

1.15

1.30

1.45

1.60

1.75

1.90go

m88

ksim gcc

com

pres

s li

ijpeg perl

vort

ex

Wor

d(A

)

Wor

d(B

)

Exc

el

PD

(A)

PD

(B)

Gsv

iew

Ave

rage

Benchmarks

Rat

io�F

etch

�IPC

�(Op

tim

ized

�/�64

KB

�IC) IC:64KB

with�TGU

IC:64KBTC:8KB

IC:64KBTC:8KB�����with�TGU

IC:64KBTC:128KB

IC:64KBTC:128KBwith�TGU

IC:128KB

1.71

5.36

3.04

4.19

4.13

6.56

5.10

8.27

3.80

2.68

6.04

5.92

3.92

4.75

Base�Fetch�IPC

Figure16: FetchedIPC for variousfetchmechanisms.

7.3 Benefitsof HardwareSupport

Theallocationof dynamicoptimizationcomponentsto hardwareor softwareimpactstheoverhead,granularity,

andflexibility of theresultingsystem.Thelargertheoverhead,thelongeranoptimizationmustpersisttoamortize

its costandclaim a benefit. If theoptimizationgranularityis kept small,moreof a programcanbe improved.

Flexibility onwhatandwhento optimizeallowsthesystemto adaptandvaryovertimeandbetweenapplications.

Hardwaremechanismspromiseto controltherun-timeoverheadsof dynamicoptimizationsystemsby trans-

parentlyapplying profiling and optimization techniques. By using dedicatedcomponents,hardware mecha-

nismstypically allow for greaterparallelism,oftenprocessingin parallelwith therunningapplication.Software

systemstypically implementdetailedprofiling by utilizing instructioninterpretation,or by insertingprofiling

probesthroughjust-in-timecompilation. This leadsto frequenttransitionsbetweenthe applicationandopti-

mizer/operatingsystem,potentiallyaddingasignificantoverhead.

Experimentshave shown minimal overheaddueto thehot spotdetectionandtracegenerationprocess.En-

hancingtheTraceGenerationUnit to employ optimizationtechniquesis likely to addminimal overheadaswell,

40

Page 41: An Architectural Framework for Run-Time Optimization

sincethehardwareoptimizerwill operatein parallelwith tracegenerationandnative programexecution.

TheHot SpotDetectorcancontinuouslymonitorprogramexecutionatlow cost,conductingprofiling without

degradingthe performanceof programuntil a hot spot is detected. In essence,our mechanismenablesfull-

speednative executionof the applicationwith minimal, decisive andsurgical optimizationsto the code. This

combinationof continuousprofiling andprecisionallows for a smalleroptimizationgranularitythana software

system.

Oneprimarybenefitof softwarereoptimizationapproachesis their flexibility. Within any profiling andcode

deployment system,a numberof strategies can be employed that can dynamicallydecidewhen and what to

optimize. While the Hot SpotDetectorandTraceGenerationUnit arehardwarestructures,they too containa

numberof parametersthatcanbeadjusteddynamically. For example,Hot SpotDetectorthresholdsandtimers

canbeadjustedto vary thecoderegion sizedetected.Thedetectorcouldbeconfiguredto identify andoptimize

critical codefirst, laterbroadeningits scopeto optimizeremainingcodesecond.

Hardwaremechanismscanalsobe constructedwith accurateknowledgeof the underlyingmicroarchitec-

ture. For example,whenreschedulingcode,the exact latenciesof processorinstructionscanbe known. This

mechanismprovidesa meansfor optimizingcodefor new microarchitecturessincetheinformationrequiredfor

performingqualityoptimizationwill bepresentin themicroarchitectureitself.

8 Conclusion

Recentinnovationsin microprocessordesignhavegiventheprocessormorecontroloverhow to executecodeop-

timally. Oursystemadvancesthestate-of-the-artby allowing theprocessorto detectthemostfrequentlyexecuted

code,to performcodestraightening,partial function inlining, andloop unrolling optimizations,andto deploy

thecodefor immediateuse,in a mannertransparentto theuserapplication.TheHot SpotDetectormonitorsthe

retiredinstructionstream,providing a relative profile of themostfrequentlyexecutedinstructionsin thestream.

Unlike otherhardwareprofilers,thedetectorutilizesanon-lineanalysisalgorithmto determinewhena suitable

region for run-timeoptimizationis found.TheTraceGenerationUnit extractstracesfrom theinstructionstream,

filtering out infrequentpathsby usingtheprofilesstoredin thedetector. Theoptimizedtracesetsarewritten into

a memoryso that they mayutilize the traditionalfetch mechanism.Thedetectionandextractionof frequently

executedcodeis doneat the retirementstageof theprocessor, off the timing-critical paths.Preliminaryresults

show thattheoptimizationsappliedby oursystemachievesignificantfetchperformanceimprovementat little ex-

trahardwarecost.Themostimportantreasonfor theperformanceimprovementis dueto themechanism’sability

41

Page 42: An Architectural Framework for Run-Time Optimization

to identify hot spotsearlyin their lifetime sothatthevastmajority of theexecutionof thesehot spotsis spentin

optimizedcode.In addition,becausethegeneratedcodeconsistsof important,persistenttraces,our mechanism

createsopportunitiesfor moreaggressive optimizations.Areasof future investigationincludeissuesinvolved in

performingcodeoptimizationin thepresenceof exceptionsandin multiprocessorenvironments,theapplication

of moreaggressive run-timeoptimizationswithin our framework, andapplicationof hot spotprofiling to static

compilersandbinaryreoptimizers.

9 Acknowledgments

Theauthorswould like to thankJohnSias,ChrisShannon,JoeMatarazzo,Hillery Hunter, andall of theother

membersof theIMPACT researchgroup,whosecommentsandsuggestionshave helpedto improve thequality

of this research.We alsothankProf. SanjayPatel for his assistancein understandingthe tracecache,andthe

anonymousrefereesfor their constructive comments.This researchwashasbeensupportedby AdvancedMicro

Devices,Hewlett-Packard,Intel, andMicrosoft.

References[1] M. C. Merten,A. R. Trick, C. N. George,J. C. Gyllenhaal,andW. W. Hwu, “A hardware-driven profiling schemefor identifying

programhotspotsto supportruntimeoptimization,” in Proceedingsof the26thInternationalSymposiumonComputerArchitecture,pp.136–147,May 1999.

[2] M. C. Merten,A. R. Trick, E. M. Nystrom,R. D. Barnes,andW. W. Hwu, “A hardwaremechanismfor dynamicextractionandrelayoutof programhot spots,” in Proceedingsof the 27th InternationalSymposiumon ComputerArchitecture, pp. 59–70,June2000.

[3] W. W. Hwu, S.A. Mahlke,W. Y. Chen,P. P. Chang,N. J.Warter, R. A. Bringmann,R. G. Ouellette,R. E. Hank,T. Kiyohara,G. E.Haab,J.G. Holm, andD. M. Lavery, “The superblock:An effective techniquefor VLIW andsuperscalarcompilation,” TheJournalof Supercomputing, vol. 7, pp.229–248,January1993.

[4] T. Ball andJ. R. Larus, “Branch predictionfor free,” in Proceedingsof the ACM SIGPLAN1993Conferenceon ProgrammingLanguage DesignandImplementation, pp.300–313,June1993.

[5] B. L. Deitrich,B. C. Cheng,andW. W. Hwu, “Improving staticbranchpredictionin acompiler,” in Proceedingsof the18thAnnualInternationalConferenceon Parallel ArchitecturesandCompilationTechniques, pp.214–221,October1998.

[6] T. Ball andJ. R. Larus,“Optimally profiling andtracingprograms,” ACM Transactionson ProgrammingLanguagesandSystems,vol. 16,pp.1319–1360,July1994.

[7] J.Anderson,L. M. Berc,J.Dean,S.Ghemawat,M. R.Henzinger, S.Leung,R.L. Sites,M. T. Vandevoorde,C.A. Waldspurger, andW. E. Weihl, “Continuousprofiling: Wherehave all thecyclesgone?,” in Proceedingsof the16thACM Symposiumof OperatingSystemsPrinciples, pp.1–14,October1997.

[8] X. Zhang,Z. Wang,N. Gloy, J.B. Chen,andM. D. Smith,“Systemsupportfor automaticprofilingandoptimization,” in Proceedingsof the16thACM Symposiumof Operating SystemsPrinciples, pp.15–26,October1997.

[9] G. Ammons,T. Ball, andJ. R. Larus, “Exploiting hardwareperformancecounterswith flow andcontext sensitive profiling,” inProceedingsof theACM SIGPLAN’97 ConferenceonProgrammingLanguage DesignandImplementation, pp.85–96,June1997.

[10] T. M. Conte,K. N. Menezes,andM. A. Hirsch, “Accurateandpracticalprofile-driven compilationusing the profile buffer,” inProceedingsof the29thAnnualInternationalSymposiumonMicroarchitecture, pp.36–45,December1996.

42

Page 43: An Architectural Framework for Run-Time Optimization

[11] J. Dean,J. E. Hicks, C. A. Waldspurger, W. E. Weihl, andG. Chrysos,“ProfileMe: Hardwaresupportfor instruction-level profil-ing on out-of-orderprocessors,” in Proceedingsof the30th AnnualInternationalSymposiumon Microarchitecture, pp. 292–302,December1997.

[12] K. EbciogluandE. R. Altman, “DAISY: Dynamiccompilationfor 100%architecturalcompatibility,” in Proceedingsof the24thInternationalSymposiumonComputerArchitecture, pp.26–37,June1997.

[13] R. J.HookwayandM. A. Herdeg, “Digital FX!32: Combiningemulationandbinarytranslation,” Digital Technical Journal, vol. 9,August1997.

[14] Transmeta,“The technologybehindCrusoeprocessors,” tech.rep.,Transmeta,http://www.transmeta.com/crusoe/technology.html,2000.

[15] M. Gschwind,E. Altman,S.Sathaye,P. Ledak,andD. Appenzeller, “Dynamicandtransparentbinarytranslation,” IEEEComputer,pp.54–59,March2000.

[16] V. Bala, E. Duesterwald, andS. Banerjia,“Dynamo: A transparentdynamicoptimizationsystem,” in Proceedingsof the ACMSIGPLAN’00 Conferenceon ProgrammingLanguage DesignandImplementation, pp.1–12,June2000.

[17] D. Dever, R. Gorton,andN. Rubin, “Wiggens/Redstone:An on-lineprogramspecializer,” in Hot Chips11, (StanfordUniversity,PaloAlto, CA), August1999.

[18] W.-K. Chen,S. Lerner, R. Chaiken,andD. M. Gilles, “Mojo: A dynamicoptimziationsystem,” in Proceedingsof theThird ACMWorkshoponFeedback-DirectedandDynamicOptimization, December2000.

[19] A.-R. Adl-Tabatabi,M. Cierniak,G.-Y. Lueh,V. M. Parikh, andJ. M. Stichnoth,“Fast,effective codegenerationin a just-in-timejava compiler,” in Proceedingsof the ACM SIGPLAN’98 Conferenceon ProgrammingLanguage Designand Implementation,pp.280–290,June1998.

[20] E. Duesterwald andV. Bala, “Softwareprofiling for hot pathprediction: Lessis more,” in Proceedingsof the 9th InternationalConferenceon Architectural Supportfor ProgrammingLanguagesandOperatingSystems, pp.202–211,December2000.

[21] B. Grant,M. Mock, M. Philipose,C. Chambers,andS.Eggers,“Annotation-directedrun-timespecializationin C,” in Proceedingsof theACM SIGPLANSymposiumon Partial EvaluationandSemantics-BasedProgramManipulation(PEPM), pp.163–178,June1997.

[22] M. Poletto,D. Engler, andM. Kaashoek,“tcc: A systemfor fast,flexible, andhigh-level dynamiccodegeneration,” in Proceedingsof theACM SIGPLAN’97 Conferenceon ProgrammingLanguage DesignandImplementation, pp.109–121,June1997.

[23] M. Mock, C. Chambers,andS.J.Eggers,“Calpa:A tool for automatingselectivedynamiccompilation,” in Proceedingsof the33thInternationalSymposiumonMicroarchitecture, pp.291–302,December2000.

[24] D. A. ConnorsandW. W. Hwu, “Compiler-directedcomputationreuse:Rationaleandinitial results,” in Proceedingsof the32ndAnnualInternationalSymposiumonMicroarchitecture, pp.158–169,November1999.

[25] W. W. Hwu andY. Patt,“Checkpointrepairfor highperformanceout-of-orderexecutionmachines,” IEEETransactiononComput-ers, vol. C-36,pp.1496–1514,December1987.

[26] E. Rotenberg, S. Bennett,and J. E. Smith, “Tracecache: A low latency approachto high bandwidthinstructionfetching,” inProceedingsof the29thInternationalSymposiumon Microarchitecture, pp.24–34,December1996.

[27] D. H. Friendly, S.J.Patel,andY. N. Patt,“Putting thefill unit to work: Dynamicoptimizationsfor tracecachemicroprocessors,” inProceedings31thAnnualIEEE/ACM InternationalSymposiumonMicroarchitecture, pp.173–181,December1998.

[28] E. Rotenberg, Y. S. Q. Jacobson,andJ. E. Smith, “Traceprocessors,” in Proceedingsof the 30th International SymposiumonMicroarchitecture, pp.138–148,December1997.

[29] Q. JacobsonandJ. E. Smith, “Instructionpre-processingin traceprocessors,” in Proceedingsof the5th InternationalSymposiumonHigh-PerformanceComputerArchitecture, pp.125–129,January1999.

[30] S.J.PatelandS.S.Lumetta,“rePLay:A hardwareframework for dynamicprogramoptimization,” Tech.Rep.CRHC-99-16,Centerfor ReliableandHigh-PerformanceComputing,Universityof Illinois, Urbana,IL, December1999.

[31] S. J. Patel,T. Tung,S. Bose,andM. Crum,“Increasingthesizeof atomicinstructionblocksby usingcontrolflow assertions,” inProceedings33rd AnnualIEEE/ACM InternationalSymposiumon Microarchitecture, pp.303–316,December2000.

[32] D. I. August,D. A. Connors,S. A. Mahlke, J. W. Sias,K. M. Crozier, B. Cheng,P. R. Eaton,Q. B. Olaniran,andW. W. Hwu,“Integratedpredicatedand speculative execution in the IMPACT EPIC architecture,” in Proceedingsof the 25th InternationalSymposiumonComputerArchitecture, pp.227–237,June1998.

[33] S.McFarling, “Combiningbranchpredictors,” Tech.Rep.TN-36,Digital, WRL, June1993.

[34] S. J.Patel,D. H. Friendly, andY. N. Patt, “Evaluationof designoptionsfor thetracecachefetchmechanism,” IEEE TransactionsonComputers,SpecialIssueonCacheMemoryandRelatedProblems, vol. 48,pp.193–204,February1999.

43