compiling for ia-64

32
Compiling Compiling for for IA-64 IA-64 Carol Thompson Carol Thompson Optimization Architect Optimization Architect Hewlett Packard Hewlett Packard

Upload: mahola

Post on 06-Jan-2016

42 views

Category:

Documents


1 download

DESCRIPTION

Compiling for IA-64. Carol Thompson Optimization Architect Hewlett Packard. CISC era: no significant ILP Compiler is merely a tool to enable use of high-level language, at some performance cost RISC era: advent of ILP Compiler-influenced architecture - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiling for IA-64

CompilingCompilingforfor

IA-64IA-64Carol ThompsonCarol ThompsonOptimization ArchitectOptimization Architect

Hewlett PackardHewlett Packard

Page 2: Compiling for IA-64

History of ILP CompilersHistory of ILP Compilers

• CISC era: no significant ILPCISC era: no significant ILP– Compiler is merely a tool to enable use of high-Compiler is merely a tool to enable use of high-

level language, at some performance costlevel language, at some performance cost• RISC era: advent of ILPRISC era: advent of ILP

– Compiler-influenced architectureCompiler-influenced architecture– Instruction scheduling becomes importantInstruction scheduling becomes important

• EPIC era: ILP as driving forceEPIC era: ILP as driving force– Compiler-specified ILPCompiler-specified ILP

Page 3: Compiling for IA-64

Increasing Scope for ILP Increasing Scope for ILP CompilationCompilation

• Early RISC CompilersEarly RISC Compilers– Basic block scope (delimited by Basic block scope (delimited by

branches & branch targets)branches & branch targets)• Superscalar RISC and early VLIW Superscalar RISC and early VLIW

CompilersCompilers– Trace scope (single entry, Trace scope (single entry,

single path)single path)– Superblocks & Hyperblocks Superblocks & Hyperblocks

(single entry, multiple path)(single entry, multiple path)• EPIC CompilersEPIC Compilers

– Composite regions: multiple Composite regions: multiple entry, multiple pathentry, multiple path

Composite Regions

Traces

Superblock

Basic Blocks

Page 4: Compiling for IA-64

Unbalanced and UnbiasedUnbalanced and UnbiasedControl FlowControl Flow

• Most code is not well balancedMost code is not well balanced– Many very small blocksMany very small blocks– Some very largeSome very large– Then and else clause frequently Then and else clause frequently

unbalancedunbalanced– Number of instructionsNumber of instructions– PathlengthPathlength

• Many branches are highly biasedMany branches are highly biased– But some are notBut some are not– Compiler can obtain frequency Compiler can obtain frequency

information from profiling or information from profiling or derive heuristically derive heuristically

60

60

0

0

40

55

55

5

5

40

Page 5: Compiling for IA-64

Basic BlocksBasic Blocks

• Basic Blocks are simpleBasic Blocks are simple– No issues with executing No issues with executing

unnecessary instructionsunnecessary instructions– No speculation or No speculation or

predication support requiredpredication support required• But, very limited ILPBut, very limited ILP

– Short blocks offer very little Short blocks offer very little opportunity for parallelismopportunity for parallelism

– Long latency code is unable Long latency code is unable to take advantage of issue to take advantage of issue bandwidth in an earlier bandwidth in an earlier blockblock

60

60

0

0

40

55

55

5

5

40

Page 6: Compiling for IA-64

TracesTraces

60

60

0

0

40

55

55

5

5

40

• Traces allow scheduling of multiple Traces allow scheduling of multiple blocks togetherblocks together

– Increases available ILPIncreases available ILP

– Long latency operations can be Long latency operations can be moved up, as long as they are on moved up, as long as they are on the same tracethe same trace

• But, unbiased branches are a But, unbiased branches are a problemproblem

– Long latency code in slightly less Long latency code in slightly less frequent paths can’t move upfrequent paths can’t move up

– Issue bandwidth may go unused Issue bandwidth may go unused (not enough concurrent (not enough concurrent instructions to fill available instructions to fill available execution units)execution units)

Page 7: Compiling for IA-64

60

60

0

0

40

55

55 5

40

5

5

Superblocks and HyperblocksSuperblocks and Hyperblocks• Superblocks and Hyperblocks Superblocks and Hyperblocks

allow inclusion of multiple allow inclusion of multiple important pathsimportant paths

– Long latency code may migrate Long latency code may migrate up from multiple pathsup from multiple paths

– Hyperblocks may be fully Hyperblocks may be fully predicatedpredicated

– More effective utilization of More effective utilization of issue bandwidthissue bandwidth

• But, requires code duplicationBut, requires code duplication

• Wholesale predication may Wholesale predication may lengthen important pathslengthen important paths

Page 8: Compiling for IA-64

Composite RegionsComposite Regions

• Allow rejoin from non-Region codeAllow rejoin from non-Region code

– Wholesale code duplication is Wholesale code duplication is not requirednot required

– Support full code motion across Support full code motion across regionregion

– Allow all interesting paths to be Allow all interesting paths to be scheduled concurrentlyscheduled concurrently

• Nested, less important Regions Nested, less important Regions bear the burden of the rejoinbear the burden of the rejoin

– Compensation code, as neededCompensation code, as needed

60

60

0

0

40

55

55

5

5

40

Page 9: Compiling for IA-64

Predication ApproachesPredication Approaches

• Full Predication of Full Predication of entire Regionentire Region– Penalizes Penalizes

short pathsshort paths

60

60

0

0

40

55

55

5

5

40

Page 10: Compiling for IA-64

On-Demand PredicationOn-Demand Predication

• Predicate (and Predicate (and Speculate) as Speculate) as neededneeded– reduce critical reduce critical

path(s)path(s)– fully utilize issue fully utilize issue

bandwidthbandwidth• Retain control flow to Retain control flow to

accommodate accommodate unbalanced pathsunbalanced paths

60

60

0

0

40

55

55

5

5

40

Page 11: Compiling for IA-64

Predicate AnalysisPredicate Analysis

• Instruction scheduler requires knowledge of Instruction scheduler requires knowledge of predicate relationshipspredicate relationships– For dependence analysisFor dependence analysis– For code motionFor code motion– ……

• Predicate Query SystemPredicate Query System– Graphical representation of predicate Graphical representation of predicate

relationshipsrelationships– Superset, subset, disjoint, …Superset, subset, disjoint, …

Page 12: Compiling for IA-64

Predicate ComputationPredicate Computation

• Compute all predicates possibly neededCompute all predicates possibly needed• OptimizeOptimize

– to share predicates where possibleto share predicates where possible– to utilize parallel comparesto utilize parallel compares– to fully utilize dual-targetsto fully utilize dual-targets

Page 13: Compiling for IA-64

Predication and Branch CountsPredication and Branch Counts

• Predication reduces branchesPredication reduces branches– at both moderate and aggressive opt. levelsat both moderate and aggressive opt. levels

Normalized Dynamic Branch Counts

00.20.40.60.8

11.2

Benchmark

-O

-O w/pred

+O4+P

+O4 +P w/pred

Page 14: Compiling for IA-64

Predication & Branch PredictionPredication & Branch Prediction

• Comparable misprediction rate with predicationComparable misprediction rate with predication

– despite significantly fewer branchesdespite significantly fewer branches increased mean time between mispredicted branchesincreased mean time between mispredicted branches

Normalized Mispredict Rates

0

0.5

1

1.5

2

Benchmark

-O

-O w/pred

+O4+P

+O4 +P w/pred

Page 15: Compiling for IA-64

Register AllocationRegister Allocation

• Modeled as a graph-coloring Modeled as a graph-coloring problem.problem.– Nodes in the graph Nodes in the graph

represent live ranges of represent live ranges of variablesvariables

– Edges represent a Edges represent a temporal overlap of the temporal overlap of the live rangeslive ranges

– Nodes sharing an edge Nodes sharing an edge must be assigned must be assigned different colors (registers)different colors (registers)

x = ...y = ...

= ... xz = ... = … y = … z

y

zx

Requires Two Colors

y

z

x

Page 16: Compiling for IA-64

Register AllocationRegister Allocation

x = ...y = ...

x

zy

With Control Flow

z = ... = … z

= … yx = ...

= … x

x

y

z

Requires Two Colors

Page 17: Compiling for IA-64

Register AllocationRegister Allocation

x

zy

With Predicationxx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

Now Requires Three Colors

y

Page 18: Compiling for IA-64

Predicate AnalysisPredicate Analysis

p0

p2p1

x

yx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

p1 and p2 are disjointIf p1 is TRUE, p2 is false

and vice versa

Page 19: Compiling for IA-64

Register AllocationRegister Allocation

x

zy

With Predicate Analysisx

yx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

Now Back to Two Colors

Page 20: Compiling for IA-64

Effect of Predicate-Aware Effect of Predicate-Aware Register AllocationRegister Allocation

• Reduces register requirements for individual Reduces register requirements for individual procedures by 0% to 75%procedures by 0% to 75%– Depends upon how aggressively predication is Depends upon how aggressively predication is

appliedapplied• Average dynamic reduction in register stack Average dynamic reduction in register stack

allocation for gcc is 4.7%allocation for gcc is 4.7%

Page 21: Compiling for IA-64

Object-Oriented CodeObject-Oriented Code

• ChallengesChallenges– Small Procedures, many Small Procedures, many

indirect (virtual)indirect (virtual)– Limits size of regions, Limits size of regions,

scope for ILPscope for ILP

– Exception HandlingException Handling

– Bounds Checking (Java)Bounds Checking (Java)– Inherently serial - must Inherently serial - must

check before check before executing load or storeexecuting load or store

SolutionsSolutionsInliningInlining

for non-virtual functions or for non-virtual functions or provably unique virtual provably unique virtual functionsfunctionsSpeculative inlining for most Speculative inlining for most common variantcommon variant

Liveness analysis of handlersLiveness analysis of handlersArchitectural support for Architectural support for speculation ensures speculation ensures recoverabilityrecoverability

Speculative executionSpeculative executionGuarantees correct Guarantees correct exception behaviorexception behavior

Dynamic optimization (e..g Java)Dynamic optimization (e..g Java)Make use of dynamic Make use of dynamic

profileprofile

Page 22: Compiling for IA-64

Method CallsMethod Calls• Barrier between execution Barrier between execution

streamsstreams

• Often, location of called Often, location of called method must be determined method must be determined at runtimeat runtime

– Costly “identity check” on Costly “identity check” on object must complete object must complete before method may beginbefore method may begin

– Even if the call nearly Even if the call nearly always goes to the same always goes to the same placeplace

– Little ILPLittle ILP

Resolvetarget

method

Call-dependentcode

Possibletarget

Possibletarget

Possibletarget

Page 23: Compiling for IA-64

Speculating Across Method Speculating Across Method CallsCalls

• Compiler predicts target methodCompiler predicts target method– ProfilingProfiling– Current state of class hierarchyCurrent state of class hierarchy

• Predicted method is inlinedPredicted method is inlined– Full or partialFull or partial

• Speculative execution of called method begins Speculative execution of called method begins while actual target is determinedwhile actual target is determined

Page 24: Compiling for IA-64

Speculation Across Method Speculation Across Method Calls Calls

Resolvetargetmethod

call method

Dominantcalled

method

Othertarget

method

Othertarget

method

call othermethod if needed

Dominantcalled

method

Othertarget

method

Othertarget

method

Resolvetarget

method

Page 25: Compiling for IA-64

Bounds & Null ChecksBounds & Null Checks

• Checks inhibit code motionChecks inhibit code motion• Null checksNull checks

x = y.foo;x = y.foo; if( y == null ) throw NullPointerException;if( y == null ) throw NullPointerException;

x = y.foo;x = y.foo;

• Bounds checksBounds checks

x = a[i];x = a[i]; if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;

if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)

throw ArrayIndexOutOfBounds Exception;throw ArrayIndexOutOfBounds Exception;

x = a[i];x = a[i];

Page 26: Compiling for IA-64

Speculating Across Bounds Speculating Across Bounds ChecksChecks

• Bounds checks rarely failBounds checks rarely fail

x = a[i];x = a[i]; ld.sld.st = a[i];t = a[i];

if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;

if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)

throw ArrayIndexOutOfBoundsException;throw ArrayIndexOutOfBoundsException;

chk.schk.s tt

x = t;x = t;

• Long latency load can begin before checksLong latency load can begin before checks

Page 27: Compiling for IA-64

Exception HandlingException Handling

• Exception handling inhibits motion of subsequent Exception handling inhibits motion of subsequent codecodeif( y.foo ) throw MyException;if( y.foo ) throw MyException;

x = y.bar + z.baz;x = y.bar + z.baz;

Page 28: Compiling for IA-64

Speculation in the Presence Speculation in the Presence of Exception Handlingof Exception Handling

• Execution of subsequent instructions may begin Execution of subsequent instructions may begin before exception is resolvedbefore exception is resolved

if( y.foo ) throw MyException;if( y.foo ) throw MyException;

x = y.bar + z.baz;x = y.bar + z.baz;

ldld t1 = y.foot1 = y.foo

ld.sld.s t2 = y.bart2 = y.bar

ld.sld.s t3 = z.bazt3 = z.baz

addadd x = t2 + t3x = t2 + t3

if( t1 ) throw MyException;if( t1 ) throw MyException;

chk.schk.s xx

Page 29: Compiling for IA-64

Dependence Graph for Dependence Graph for Instruction SchedulingInstruction Scheduling

add t1 = 8,p

(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

If( n < p->count ) {If( n < p->count ) {

(*log)++;(*log)++;

return p->x[n];return p->x[n];

} else {} else {

return 0;return 0;

}}

Page 30: Compiling for IA-64

Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation

add t1 = 8,p

(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

chk.a t4

chk.a p

• During dependence graph During dependence graph construction, potentially construction, potentially controlcontrol and and datadata speculative edges and speculative edges and nodes are identifiednodes are identified

• Check nodes are added Check nodes are added where possibly needed where possibly needed (note that only data (note that only data speculation checks are speculation checks are shown here)shown here)

Page 31: Compiling for IA-64

Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation

add t1 = 8,p(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

(p2) mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

chk.a t4chk.a p

• Speculative edges may be violated. Here the graph is re-drawn to show the Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelismenhanced parallelism

• Note that the speculation of both writes to the out0 register would require Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its schedulinginsertion of a copy. The scheduler must consider this in its scheduling

• Nodes with sufficient slack (e.g. writes to out0) will not be speculatedNodes with sufficient slack (e.g. writes to out0) will not be speculated

Page 32: Compiling for IA-64

ConclusionsConclusions• IA-64 compilers push the complexity of the compilerIA-64 compilers push the complexity of the compiler

– However, the technology is a logical progression However, the technology is a logical progression from today’sfrom today’s– Today’s RISC compilersToday’s RISC compilers

– are more complex are more complex – are more reliableare more reliable– and deliver more performanceand deliver more performance

than those of the early daysthan those of the early days– Complexity trend is mirrored in both hardware and Complexity trend is mirrored in both hardware and

applicationsapplications– Need a balance to maximize benefits from eachNeed a balance to maximize benefits from each