accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...

137
Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and summary Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang Department of Electrical & Computer Engineering University of Rochester, Rochester, NY Raj Parihar Advanced Computer Architecture Lab University of Rochester

Upload: others

Post on 02-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY

Raj Parihar Advanced Computer Architecture Lab University of Rochester

Page 2: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation

Despite the proliferation of multi-core, multi-threaded systems

High single-thread performance is still an important CPU design goal

Modern programs do not lack instruction level parallelism

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

50

IPC

ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K

Raj Parihar Advanced Computer Architecture Lab University of Rochester 2

Page 3: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation

Despite the proliferation of multi-core, multi-threaded systems

High single-thread performance is still an important CPU design goal

Modern programs do not lack instruction level parallelism

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

50

IPC

ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K

Real challenge: exploit implicit parallelism without undue cost

One effective approach: Decoupled look-ahead architecture

Raj Parihar Advanced Computer Architecture Lab University of Rochester 3

Page 4: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

Raj Parihar Advanced Computer Architecture Lab University of Rochester 4

Page 5: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

50

IPC

ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K

Raj Parihar Advanced Computer Architecture Lab University of Rochester 5

Page 6: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

Lack of correctness constraint allows many optimizations

Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread

Raj Parihar Advanced Computer Architecture Lab University of Rochester 6

Page 7: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

Lack of correctness constraint allows many optimizations

Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread

Raj Parihar Advanced Computer Architecture Lab University of Rochester 6

Page 8: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Outline

Motivation

Baseline decoupled look-ahead

Look-ahead: a new bottleneck

Look-ahead thread acceleration

Weak dependences/instructions

Do-It-Yourself branches & skeleton tuning

Experimental analysis

Additional insights and summary

Raj Parihar Advanced Computer Architecture Lab University of Rochester 7

Page 9: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead: a new bottleneck

Baseline Decoupled Look-ahead Architecture

Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and

Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching

Main Core

Branch QueueLook-ahead Core

L0$ L1$

Executes Look-aheadskeleton

Executes programbinary

L2$

Register state synchronization

Prefetching hints

Branch prediction1

2

Main Memory

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO’08

Raj Parihar Advanced Computer Architecture Lab University of Rochester 8

Page 10: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead: a new bottleneck

Baseline Decoupled Look-ahead Architecture

Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and

Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching

Main Core

Branch QueueLook-ahead Core

L0$ L1$

Executes Look-aheadskeleton

Executes programbinary

L2$

Register state synchronization

Prefetching hints

Branch prediction1

2addq v0, v0, v0nop......bgt a1, 0x12001f9a0subq v0, t0, a2

addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0subq a1, 0x2, a1addq v0, v0, v0bgt a1, 0x12001f9a0subq v0, t0, a2

Main Memory

Program binary

Skeleton

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08

Raj Parihar Advanced Computer Architecture Lab University of Rochester 9

Page 11: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead: a new bottleneck

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead limit

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck

aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0

1

2

3

4

IPC

Single thread

Raj Parihar Advanced Computer Architecture Lab University of Rochester 10

Page 12: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead: a new bottleneck

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead limit

Application categories:Bottleneck removed; speed of look-ahead is not an issue (left half)Look-ahead thread is the new bottleneck

aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0

1

2

3

4

IPC

Single−thread Decoupled Look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 11

Page 13: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead: a new bottleneck

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead limit

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck

aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0

1

2

3

4

IPC

Single−thread Decoupled Look−ahead Ideal (cache, branch)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 12

Page 14: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead: a new bottleneck

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead limit

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0

1

2

3

4

IPC

Look−ahead limit Single−thread Decoupled Look−ahead Ideal (cache, branch)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 13

Page 15: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Weak Dependences/Instructions

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs (100s of)

Weak instruction can be experimentally defined and their

impact quantified in isolation

Raj Parihar Advanced Computer Architecture Lab University of Rochester 14

Page 16: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Weak Dependences/Instructions

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs (100s of)

Weak instruction can be experimentally defined and their

impact quantified in isolation

Raj Parihar Advanced Computer Architecture Lab University of Rochester 14

Page 17: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Weak Dependences/Instructions

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs (100s of)

Weak instruction can be experimentally defined and their

impact quantified in isolation

Raj Parihar Advanced Computer Architecture Lab University of Rochester 14

Page 18: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Challenge # 1: Weak insts do not look different

After the fact analysis: based on static attributes of insts reveals

Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)

Weakness has very poor correlation with static attributes

Hard to identify the weak instructions through static heuristics

addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0

1

2

Instruction Type

Num

ber

of In

puts

WeakInstructions

addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0

1

2

Instruction Type

Num

ber

of In

puts

StrongInstructions

Raj Parihar Advanced Computer Architecture Lab University of Rochester 15

Page 19: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Challenge # 1: Weak insts do not look different

After the fact analysis: based on static attributes of insts reveals

Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)

Weakness has very poor correlation with static attributes

Hard to identify the weak instructions through static heuristics

addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0

1

2

Instruction Type

Num

ber

of In

puts

WeakInstructions

addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0

1

2

Instruction Type

Num

ber

of In

puts

StrongInstructions

Raj Parihar Advanced Computer Architecture Lab University of Rochester 15

Page 20: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Challenge # 2: False positives are extremely costly

After the fact analysis and close inspection also reveals

Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains

Case in point: zapnot in gap

zapnot Ra Rb Rc

84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%

Raj Parihar Advanced Computer Architecture Lab University of Rochester 16

Page 21: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Challenge # 2: False positives are extremely costly

After the fact analysis and close inspection also reveals

Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains

Case in point: zapnot in gap

zapnot Ra Rb Rc

84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%

Raj Parihar Advanced Computer Architecture Lab University of Rochester 16

Page 22: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Challenge # 3: Neither absolute nor additive

Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!

Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown

0 50 100 150 200 250 300−40%

−30%

−20%

−10%

0%

10%

20%

Cummulative weak instructions

Per

form

ance

impa

ct o

ver

base

line

look

−ah

ead

perlbmk

Raj Parihar Advanced Computer Architecture Lab University of Rochester 17

Page 23: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Challenge # 3: Neither absolute nor additive

Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!

Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown

0 50 100 150 200 250 300−40%

−30%

−20%

−10%

0%

10%

20%

Cummulative weak instructions

Per

form

ance

impa

ct o

ver

base

line

look

−ah

ead

perlbmk

Raj Parihar Advanced Computer Architecture Lab University of Rochester 17

Page 24: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Metaheuristic Based Trail-and-Error Approach

Recap: Challenges in identifying weak instructions

Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive

Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic

Metaheuristic: Completely agnostic of meaning of solution

Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.

R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14

Raj Parihar Advanced Computer Architecture Lab University of Rochester 18

Page 25: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Metaheuristic Based Trail-and-Error Approach

Recap: Challenges in identifying weak instructions

Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive

Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic

Metaheuristic: Completely agnostic of meaning of solution

Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.

R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14

Raj Parihar Advanced Computer Architecture Lab University of Rochester 18

Page 26: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Metaheuristic Based Trail-and-Error Approach

Recap: Challenges in identifying weak instructions

Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive

Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic

Metaheuristic: Completely agnostic of meaning of solution

Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.

R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14

Raj Parihar Advanced Computer Architecture Lab University of Rochester 18

Page 27: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Genetic Algorithm based Framework

The problem naturally maps to genetic algorithm

Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)

Genetic evolution: Procreation, mutation, fitness-based selection

ProgramBinary

Look-aheadBinary

Chromosome creation GA evolution

Single-GeneChromosome

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

Fitness test,Elitism

Initi

al C

hrom

osom

e P

opul

atio

n

Look-ahead construction

Sin

gle-

Inst

ruct

ion

Gen

es

(Binary Parser)

12

3

4

5

6

7

8

Mul

ti-In

stru

ctio

n G

enes

SuperpositionChromosome

OrthogonalChromosome

Xover &Mutation

De-duplication

Raj Parihar Advanced Computer Architecture Lab University of Rochester 19

Page 28: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Genetic Algorithm based Framework

The problem naturally maps to genetic algorithm

Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)

Genetic evolution: Procreation, mutation, fitness-based selection

ProgramBinary

Look-aheadBinary

Chromosome creation GA evolution

Single-GeneChromosome

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

Fitness test,Elitism

Initi

al C

hrom

osom

e P

opul

atio

n

Look-ahead construction

Sin

gle-

Inst

ruct

ion

Gen

es

(Binary Parser)

12

3

4

5

6

7

8

Mul

ti-In

stru

ctio

n G

enes

SuperpositionChromosome

OrthogonalChromosome

Xover &Mutation

De-duplication

Raj Parihar Advanced Computer Architecture Lab University of Rochester 19

Page 29: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Genetic Algorithm based Framework

The problem naturally maps to genetic algorithm

Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)

Genetic evolution: Procreation, mutation, fitness-based selection

ProgramBinary

Look-aheadBinary

Chromosome creation GA evolution

Single-GeneChromosome

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

Fitness test,Elitism

Initi

al C

hrom

osom

e P

opul

atio

n

Look-ahead construction

Sin

gle-

Inst

ruct

ion

Gen

es

(Binary Parser)

12

3

4

5

6

7

8

Mul

ti-In

stru

ctio

n G

enes

SuperpositionChromosome

OrthogonalChromosome

Xover &Mutation

De-duplication

Raj Parihar Advanced Computer Architecture Lab University of Rochester 19

Page 30: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Speedup of Weak Dependence Removal

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x

craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1

2

3

4

5

6

Spe

edup

ove

r si

ngle

−th

read

Baseline look−aheadGA based look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 20

Page 31: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Speedup of Weak Dependence Removal

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.11x (geomean)

Overall speedup over single-thread baseline: 1.48x

craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1

2

3

4

5

6

Spe

edup

ove

r si

ngle

−th

read

Baseline look−aheadGA based look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 20

Page 32: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Speedup of Weak Dependence Removal

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x

craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1

2

3

4

5

6

Spe

edup

ove

r si

ngle

−th

read

Baseline look−aheadGA based look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 20

Page 33: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Progress of Genetic Evolution Process

Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved

GA evolution, helped by hybridization shows good progress

1 2 3 4 5 6 70%

20%

40%

60%

80%

100%

# of Generations

Pro

gres

s re

lativ

e to

bes

t GA

sol

utio

n

eon

mcf

pbmk

twolf

vpr

art

eqk

fma

amp

lucas

Raj Parihar Advanced Computer Architecture Lab University of Rochester 21

Page 34: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Progress of Genetic Evolution Process

Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved

GA evolution, helped by hybridization shows good progress

1 2 3 4 5 6 70%

20%

40%

60%

80%

100%

# of Generations

Pro

gres

s re

lativ

e to

bes

t GA

sol

utio

n

eon

mcf

pbmk

twolf

vpr

art

eqk

fma

amp

lucas

Raj Parihar Advanced Computer Architecture Lab University of Rochester 21

Page 35: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Evolution can be Online or Offline

Offline evolution: one time tuning (e.g. install time)

Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result

Online evolution: takes longer but has little overhead

Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations

1  

1.5  

2  

2.5  

3  

1  116  

231  

346  

461  

576  

691  

806  

921  

1036  

1151  

1266  

1381  

1496  

1611  

1726  

1841  

1956  

2071  

2186  

2301  

2416  

2531  

2646  

2761  

2876  

2991  

3106  

3221  

3336  

3451  

3566  

3681  

3796  

3911  

4026  

4141  

4256  

4371  

4486  

4601  

4716  

Accumulated

 IPC  

Number  of  instruc6ons  (in  millions)  

Single-­‐thread  baseline   Baseline  decoupled  look-­‐ahead   Online  self-­‐tuned  look-­‐ahead  

Raj Parihar Advanced Computer Architecture Lab University of Rochester 22

Page 36: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Evolution can be Online or Offline

Offline evolution: one time tuning (e.g. install time)

Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result

Online evolution: takes longer but has little overhead

Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations

1  

1.5  

2  

2.5  

3  

1  116  

231  

346  

461  

576  

691  

806  

921  

1036  

1151  

1266  

1381  

1496  

1611  

1726  

1841  

1956  

2071  

2186  

2301  

2416  

2531  

2646  

2761  

2876  

2991  

3106  

3221  

3336  

3451  

3566  

3681  

3796  

3911  

4026  

4141  

4256  

4371  

4486  

4601  

4716  

Accumulated

 IPC  

Number  of  instruc6ons  (in  millions)  

Single-­‐thread  baseline   Baseline  decoupled  look-­‐ahead   Online  self-­‐tuned  look-­‐ahead  

Raj Parihar Advanced Computer Architecture Lab University of Rochester 22

Page 37: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

A Locomotive and Cargo Analogy

Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload

Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo

(under utilization of locomotive’s capability)

L1 prefetches L2 prefetches

Raj Parihar Advanced Computer Architecture Lab University of Rochester 23

Page 38: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

A Locomotive and Cargo Analogy

Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload

Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo

(under utilization of locomotive’s capability)

L1 prefetches L2 prefetches

Raj Parihar Advanced Computer Architecture Lab University of Rochester 23

Page 39: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Idea of Do-It-Yourself (DIY) Branches

Extends the idea of weak instructions to easy-to-predict branches

To accelerate the look-ahead thread via DIY branches

Either skip completely or partially execute in the skeleton

BR

B

C

BR

BA

C

BR

BA

C

(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]

BR

A

(4) DIY [A -> C]

B

C

A

BR

A

(5) DIY [A -> B -> BR -> C]

B

C

ZAP

ZAP

LEFT

FALL

RIGHT

(A) Forward conditional branch (If-Than, If-Than-Else) transformations

(B) Backward conditional branch (Loop) transformations

Tune skeleton via selectively including/excluding prefetches

R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 24

Page 40: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Idea of Do-It-Yourself (DIY) Branches

Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches

Either skip completely or partially execute in the skeleton

BR

B

C

BR

BA

C

BR

BA

C

(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]

BR

A

(4) DIY [A -> C]

B

C

A

BR

A

(5) DIY [A -> B -> BR -> C]

B

C

ZAP

ZAP

LEFT

FALL

RIGHT

(A) Forward conditional branch (If-Than, If-Than-Else) transformations

(B) Backward conditional branch (Loop) transformations

Tune skeleton via selectively including/excluding prefetches

R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 24

Page 41: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Idea of Do-It-Yourself (DIY) Branches

Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches

Either skip completely or partially execute in the skeleton

BR

B

C

BR

BA

C

BR

BA

C

(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]

BR

A

(4) DIY [A -> C]

B

C

A

BR

A

(5) DIY [A -> B -> BR -> C]

B

C

ZAP

ZAP

LEFT

FALL

RIGHT

(A) Forward conditional branch (If-Than, If-Than-Else) transformations

(B) Backward conditional branch (Loop) transformations

Tune skeleton via selectively including/excluding prefetchesR. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 24

Page 42: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Hardware Support for DIY Branches

Hardware support needed to synchronize after DIY regions

Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion

Look-aheadThread

MainThread

Branch Queue

Branch Predictor

DIY Mode

i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]

Direction+

DIY info

DIY call depth register+

DIY mode bit

Skeleton [mask, diy, duty]

Raj Parihar Advanced Computer Architecture Lab University of Rochester 25

Page 43: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Hardware Support for DIY Branches

Hardware support needed to synchronize after DIY regions

Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion

Look-aheadThread

MainThread

Branch Queue

Branch Predictor

DIY Mode

i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]

Direction+

DIY info

DIY call depth register+

DIY mode bit

Skeleton [mask, diy, duty]

Raj Parihar Advanced Computer Architecture Lab University of Rochester 25

Page 44: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Experimental Setup

Program/binary analysis tool: ALTO

Simulator: detailed out-of-order,cycle-level in-house

SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)

Genetic algorithm framework

Modeled as offline and onlineextension to the simulator

Microarchitectural configurations:Baseline core (Similar to POWER5)

Fetch/Decode/Issue/Commit 8 / 4 / 6 / 6ROB 128Functional units INT 2+1 mul +1 div, FP 2+1 mul +1 divFetch Q/ Issue Q / Reg. (int,fp) (32, 32) / (32, 32) / (80, 80)LSQ(LQ,SQ) 64 (32,32) 2 search portsBranch predictor Gshare – 8K entries, 13 bit historyBr. mispred. penalty at least 7 cyclesL1 data cache (private) 32KB, 4-way, 64B line, 2 cycles, 2 portsL1 inst cache (private) 64KB, 2-way, 128B, 2 cyclesL2 cache (shared) 1MB, 8-way, 128B, 15 cyclesMemory access latency 200 cyclesLook-ahead core: Baseline core with only LQ, no SQ

L0 cache: 32KB, 4-way, 64B line, 2 cyclesRound trip latency to L1: 6 cycles

Communication: Branch Output Queue: 512 entriesReg copy latency (recovery): 64 cycles

Table 1: Microarchitectural configurations.

Raj Parihar Advanced Computer Architecture Lab University of Rochester 26

Page 45: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Individual Performance Gains

Speedup of DIY branches over baseline look-ahead: 1.08x

Speedup of Skeleton Payload Tuning: 1.12x

Combined speedup (DIY + Payload Tuning): 1.15x

gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean

1.0

1.4

1.8

2.2

Spe

edup

ove

r si

ngle

-thr

ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead

15%

Weak InstsRemoval(16.2%)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 27

Page 46: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Individual Performance Gains

Speedup of DIY branches over baseline look-ahead: 1.08x

Speedup of Skeleton Payload Tuning: 1.12x

Combined speedup (DIY + Payload Tuning): 1.15x

gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean

1.0

1.4

1.8

2.2

Spe

edup

ove

r si

ngle

-thr

ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead

15%

Weak InstsRemoval(16.2%)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 27

Page 47: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Individual Performance Gains

Speedup of DIY branches over baseline look-ahead: 1.08x

Speedup of Skeleton Payload Tuning: 1.12x

Combined speedup (DIY + Payload Tuning): 1.15x

gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean

1.0

1.4

1.8

2.2

Spe

edup

ove

r si

ngle

-thr

ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead

15%

Weak InstsRemoval(16.2%)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 27

Page 48: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Overall Performance Gain

Final decoupled look-ahead system

Skeleton payload tuning + Weak dependence + DIY branches

Performance speedup over:

Baseline look-ahead: 1.20x Single-thread: 1.61x

gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0

1.2

1.4

1.6

1.8

2.0

Spe

edup

ove

r B

asel

ine

DLA

Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA

Raj Parihar Advanced Computer Architecture Lab University of Rochester 28

Page 49: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Overall Performance Gain

Final decoupled look-ahead system

Skeleton payload tuning + Weak dependence + DIY branches

Performance speedup over:

Baseline look-ahead: 1.20x Single-thread: 1.61x

gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0

1.2

1.4

1.6

1.8

2.0

Spe

edup

ove

r B

asel

ine

DLA

Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA

Raj Parihar Advanced Computer Architecture Lab University of Rochester 28

Page 50: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Salient Features of Decoupled Look-ahead

Hard-to-predict, data-dependent branches

Conventional predictors do not capture data-dependent behaviors

Prefetching for cold misses

Conventional prefetchers take time to learn access/miss patterns

Other potential advantages:

Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)

Potential hurdles and showstoppers:

If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically

Raj Parihar Advanced Computer Architecture Lab University of Rochester 29

Page 51: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Salient Features of Decoupled Look-ahead

Hard-to-predict, data-dependent branches

Conventional predictors do not capture data-dependent behaviors

Prefetching for cold misses

Conventional prefetchers take time to learn access/miss patterns

Other potential advantages:

Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)

Potential hurdles and showstoppers:

If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically

Raj Parihar Advanced Computer Architecture Lab University of Rochester 29

Page 52: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Salient Features of Decoupled Look-ahead

Hard-to-predict, data-dependent branches

Conventional predictors do not capture data-dependent behaviors

Prefetching for cold misses

Conventional prefetchers take time to learn access/miss patterns

Other potential advantages:

Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)

Potential hurdles and showstoppers:

If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically

Raj Parihar Advanced Computer Architecture Lab University of Rochester 29

Page 53: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Salient Features of Decoupled Look-ahead

Hard-to-predict, data-dependent branches

Conventional predictors do not capture data-dependent behaviors

Prefetching for cold misses

Conventional prefetchers take time to learn access/miss patterns

Other potential advantages:

Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)

Potential hurdles and showstoppers:

If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically

Raj Parihar Advanced Computer Architecture Lab University of Rochester 29

Page 54: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Future Explorations

Effective look-ahead to improve L1 prefetching performance

Speeding up critical threads and serial bottlenecks via a shared

look-ahead agent in multi-threaded applications

Cost effective SMT implementation of decoupled look-ahead

Role of look-ahead to promote parallelization, value predictions

and acceleration of interpreted programs

Backward strawman: integrate non-speculative look-ahead

computations in the main thread directly

Raj Parihar Advanced Computer Architecture Lab University of Rochester 30

Page 55: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Details in the Thesis and Papers

Decoupled Look-ahead Architecture:

Weak dependence removal in decoupled look-ahead [HPCA’14]

Load balancing in look-ahead via DIY branches [PACT-SRC’15]

Speculative parallelization in decoupled look-ahead [PACT’11]

DIY branches and payload tuning [in prep. for HPCA’17]

Shared Cache Management:

Hardware support for protective and collaborative caches [ISMM’16]

Protection and utilization in shared cache via rationing [PACT’14]

A coldness metric for cache optimization [MSPC’13]

Raj Parihar Advanced Computer Architecture Lab University of Rochester 31

Page 56: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation: Cache Rationing

Compute systems with shared resources are prevalent today

Multi-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources

Raj Parihar Advanced Computer Architecture Lab University of Rochester 32

Page 57: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Motivation: Cache Rationing

Compute systems with shared resources are prevalent todayMulti-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources

Significant performance loss due to co-run interference: >25%

Equal partitioning No partitioning Rationing PIPP-equal0.7

0.8

0.9

1.0

1.1

1.2

1.3

IPC

Nor

m. t

o so

lo r

un w

/ 512

KB

L2$

2.32

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with equake(2 cores, 1 MB L2 cache)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 33

Page 58: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Idea of Cache Rationing

Achieve resource protection and utilization both simultaneously

Rationing policy:

Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs

Conservative sharing: provides a safety net for less aggressive

programs in the presence of non cooperative programs

R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16

Raj Parihar Advanced Computer Architecture Lab University of Rochester 34

Page 59: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Idea of Cache Rationing

Achieve resource protection and utilization both simultaneously

Rationing policy:

Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs

Conservative sharing: provides a safety net for less aggressive

programs in the presence of non cooperative programs

R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16

Raj Parihar Advanced Computer Architecture Lab University of Rochester 34

Page 60: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Hardware Support

Ration Accounting: ration counter-register pairs

To track the current usage of a program, maintained per core per set

Usage Tracking: access-bit and block owner

To detect unused ration and ensure entitlement, 1 per cache blk

blk 1 blk w-1

w ways

Data array

Access bit

p counter-register pairs

s se

ts

Ration tracker

w ways

Status bit

Tag array

Block ownerRation counter

Owner allocation

Additional storage overhead: <1% of total cache storage

Raj Parihar Advanced Computer Architecture Lab University of Rochester 35

Page 61: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Hardware Support

Ration Accounting: ration counter-register pairs

To track the current usage of a program, maintained per core per set

Usage Tracking: access-bit and block owner

To detect unused ration and ensure entitlement, 1 per cache blk

blk 1 blk w-1

w ways

Data array

Access bit

p counter-register pairs

s se

ts

Ration tracker

w ways

Status bit

Tag array

Block ownerRation counter

Owner allocation

Additional storage overhead: <1% of total cache storage

Raj Parihar Advanced Computer Architecture Lab University of Rochester 35

Page 62: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Resource Protection Co-Run

Co-run with a high-pressure peer (mcf)

Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage

Equal partitioning No partitioning Rationing PIPP-equal0.7

0.8

0.9

1.0

1.1

1.2

1.3

IPC

nor

m. t

o so

lo r

un w

/ 512

KB

L2$

... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...

1.52

SPEC 2000: with mcf(2 cores, 1 MB L2 cache)

INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf

FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi

Raj Parihar Advanced Computer Architecture Lab University of Rochester 36

Page 63: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Resource Protection Co-Run

Co-run with a high-pressure peer (mcf)Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage

Equal partitioning No partitioning Rationing PIPP-equal0.7

0.8

0.9

1.0

1.1

1.2

1.3

IPC

nor

m. t

o so

lo r

un w

/ 512

KB

L2$

... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...

1.52

SPEC 2000: with mcf(2 cores, 1 MB L2 cache)

INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf

FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi

Raj Parihar Advanced Computer Architecture Lab University of Rochester 36

Page 64: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Capacity Utilization Co-Run

Co-run with a low-pressure peer (eon): cache demand <128 KB

Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs

Equal partitioning No partitioning Rationing PIPP-equal0.8

1.0

1.2

1.4

IPC

Nor

m. t

o so

lo r

un w

/ 512

KB

L2$

1.744

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with eon(2 cores, 1 MB L2 cache)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 37

Page 65: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Capacity Utilization Co-Run

Co-run with a low-pressure peer (eon): cache demand <128 KB

Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs

Equal partitioning No partitioning Rationing PIPP-equal0.8

1.0

1.2

1.4

IPC

Nor

m. t

o so

lo r

un w

/ 512

KB

L2$

1.744

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with eon(2 cores, 1 MB L2 cache)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 37

Page 66: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary

Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations

Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries

Metaheuristic based self-tuning approach is simple and robust

Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Multi-threaded workload can benefit from speeding up serial

sections and bottleneck threads in critical regions

Raj Parihar Advanced Computer Architecture Lab University of Rochester 38

Page 67: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary

Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations

Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries

Metaheuristic based self-tuning approach is simple and robust

Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Multi-threaded workload can benefit from speeding up serial

sections and bottleneck threads in critical regions

Raj Parihar Advanced Computer Architecture Lab University of Rochester 38

Page 68: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary

Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations

Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries

Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Multi-threaded workload can benefit from speeding up serial

sections and bottleneck threads in critical regions

Raj Parihar Advanced Computer Architecture Lab University of Rochester 38

Page 69: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary

Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations

Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries

Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Multi-threaded workload can benefit from speeding up serial

sections and bottleneck threads in critical regionsRaj Parihar Advanced Computer Architecture Lab University of Rochester 38

Page 70: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Acknowledgments

Funding agencies: NSF, NSFC

Prof. Michael C. Huang Alok Garg

Prof. Chen Ding and his research group at URCS

Past & current members of Advanced Computer Architecture

Lab at University of Rochester

Raj Parihar Advanced Computer Architecture Lab University of Rochester 39

Page 71: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Backup Slides

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY

Raj Parihar Advanced Computer Architecture Lab University of Rochester 40

Page 72: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary of Distillation Techniques

Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs

Stores would have been committed by main thread

Selective value (zero) substitution for the L2 misses

If the look-ahead distance drops below a threshold

Speculative parallelization of skeleton w/o any rollback support

Weak dependence/instruction removal from skeleton

Consecutive loop iterations accessing the same cache line

A wide variety of library calls and reduction operations

Selective payload: eliminate payload if it slows look-ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 41

Page 73: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary of Distillation Techniques

Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs

Stores would have been committed by main thread

Selective value (zero) substitution for the L2 misses

If the look-ahead distance drops below a threshold

Speculative parallelization of skeleton w/o any rollback support

Weak dependence/instruction removal from skeleton

Consecutive loop iterations accessing the same cache line

A wide variety of library calls and reduction operations

Selective payload: eliminate payload if it slows look-ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 41

Page 74: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary of Distillation Techniques

Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs

Stores would have been committed by main thread

Selective value (zero) substitution for the L2 misses

If the look-ahead distance drops below a threshold

Speculative parallelization of skeleton w/o any rollback support

Weak dependence/instruction removal from skeleton

Consecutive loop iterations accessing the same cache line

A wide variety of library calls and reduction operations

Selective payload: eliminate payload if it slows look-ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 41

Page 75: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Summary of Distillation Techniques

Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs

Stores would have been committed by main thread

Selective value (zero) substitution for the L2 misses

If the look-ahead distance drops below a threshold

Speculative parallelization of skeleton w/o any rollback support

Weak dependence/instruction removal from skeleton

Consecutive loop iterations accessing the same cache line

A wide variety of library calls and reduction operations

Selective payload: eliminate payload if it slows look-ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 41

Page 76: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Practical Advantages of Decoupled Look-ahead

Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)

Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit (ideal)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 42

Page 77: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Practical Advantages of Decoupled Look-ahead

Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)

Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit (ideal)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 42

Page 78: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Practical Advantages of Decoupled Look-ahead

Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)

Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit (ideal)Decoupled look−ahead

4.0

Raj Parihar Advanced Computer Architecture Lab University of Rochester 43

Page 79: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Cache misses90% 95%

DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68

Branch mispredictions90% 95%

DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236

Raj Parihar Advanced Computer Architecture Lab University of Rochester 44

Page 80: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Cache misses90% 95%

DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68

Branch mispredictions90% 95%

DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236

Raj Parihar Advanced Computer Architecture Lab University of Rochester 44

Page 81: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Cache misses90% 95%

DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68

Branch mispredictions90% 95%

DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236

Raj Parihar Advanced Computer Architecture Lab University of Rochester 44

Page 82: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Correlation with RTL/FPGA Accurate Simulator

Reported performance improvement results are very pessimistic

Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces

RTL accurate simulator: shows 2x more performance potential

Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0

1.2

1.4

1.6

1.8

2.0

2.2

Spe

edup

ove

r si

ngle

-thr

ead SimpleScalar

IMG-psim

1.56x(projected)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 45

Page 83: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Correlation with RTL/FPGA Accurate Simulator

Reported performance improvement results are very pessimistic

Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces

RTL accurate simulator: shows 2x more performance potential

Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0

1.2

1.4

1.6

1.8

2.0

2.2

Spe

edup

ove

r si

ngle

-thr

ead SimpleScalar

IMG-psim

1.56x(projected)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 45

Page 84: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Simplified Look-ahead Core

Baseline skeleton: 71% After distillation: 57%

2-wide look-ahead core: (Front end is still 8-wide)

2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%

Renam

eROB

Int I

QFp

IQ

Decod

e

RAT-dec

RAT-wl

RAT-bl

DCL-cm

pLS

QTot

al0

20

40

60

80

100

% P

ower

(N

orm

. to

4-w

ide

DLA

)

3-wide Look-ahead core2-wide Look-ahead core

Look-ahead core components

Raj Parihar Advanced Computer Architecture Lab University of Rochester 46

Page 85: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Simplified Look-ahead Core

Baseline skeleton: 71% After distillation: 57%2-wide look-ahead core: (Front end is still 8-wide)

2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%

Renam

eROB

Int I

QFp

IQ

Decod

e

RAT-dec

RAT-wl

RAT-bl

DCL-cm

pLS

QTot

al0

20

40

60

80

100

% P

ower

(N

orm

. to

4-w

ide

DLA

)

3-wide Look-ahead core2-wide Look-ahead core

Look-ahead core components

Raj Parihar Advanced Computer Architecture Lab University of Rochester 46

Page 86: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Baseline vs. Tuned Skeleton

Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed

Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead

4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1

1.1

1.2

1.3

1.4

1.5

Spe

edup

ove

r S

ingl

e-T

hrea

d

Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead

INT FP

Raj Parihar Advanced Computer Architecture Lab University of Rochester 47

Page 87: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Baseline vs. Tuned Skeleton

Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed

Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead

4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1

1.1

1.2

1.3

1.4

1.5

Spe

edup

ove

r S

ingl

e-T

hrea

d

Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead

INT FP

Raj Parihar Advanced Computer Architecture Lab University of Rochester 47

Page 88: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Hybridization: Heuristically Designed Initial Solutions

Genetic evolution could be a slow and lengthy process

Heuristic based solutions are helpful to jump start the evolution

Heuristically designed solutions in our system:

Superposition chromosome; Orthogonal subroutine chromosome

X

Multi-Instruction Genes

X XX X X

X X

X XX X XX X X X

X X X X X X X X X

Initial Chromosomes

A B CSubroutines

(b) Superposition Chromosomes

(c) Orthogonal Chromosomes

(a) Single-gene Chromosomes

Single-Instruction Genes

Chromosome

XX

XX

X

Raj Parihar Advanced Computer Architecture Lab University of Rochester 48

Page 89: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Hybridization: Heuristically Designed Initial Solutions

Genetic evolution could be a slow and lengthy process

Heuristic based solutions are helpful to jump start the evolution

Heuristically designed solutions in our system:

Superposition chromosome; Orthogonal subroutine chromosome

X

Multi-Instruction Genes

X XX X X

X X

X XX X XX X X X

X X X X X X X X X

Initial Chromosomes

A B CSubroutines

(b) Superposition Chromosomes

(c) Orthogonal Chromosomes

(a) Single-gene Chromosomes

Single-Instruction Genes

Chromosome

XX

XX

X

Raj Parihar Advanced Computer Architecture Lab University of Rochester 48

Page 90: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Online Genetic Evolution: equake

Primary overhead comes from testing bad skeleton configs

Break-even point: 1.8 billion insts (1-2 sec of native execution)By 4.6 billion insts: overall cumulative speed is already 10% faster

1  

1.5  

2  

2.5  

3  

1  138  

275  

412  

549  

686  

823  

960  

1097  

1234  

1371  

1508  

1645  

1782  

1919  

2056  

2193  

2330  

2467  

2604  

2741  

2878  

3015  

3152  

3289  

3426  

3563  

3700  

3837  

3974  

4111  

4248  

4385  

4522  

4659  Ac

cumulated

 IPC  

Number  of  instruc6ons  (1  epoch  =  1  million  instruc6ons)  

Single-­‐thread  baseline   Baseline  decoupled  look-­‐ahead   Online  self-­‐tuned  look-­‐ahead  

0.5  1  

1.5  2  

2.5  3  

3.5  

1  13

8  27

5  41

2  54

9  68

6  82

3  96

0  10

97  

1234

 13

71  

1508

 16

45  

1782

 19

19  

2056

 21

93  

2330

 24

67  

2604

 27

41  

2878

 30

15  

3152

 32

89  

3426

 35

63  

3700

 38

37  

3974

 41

11  

4248

 43

85  

4522

 46

59  

Distrib

uted

 IPC  

Number  of  instruc4ons  (1  epoch  =  1  million  instruc4ons)  

Single-­‐thread  baseline   Baseline  decoupled  look-­‐ahead   Online  self-­‐tuned  look-­‐ahead  

Raj Parihar Advanced Computer Architecture Lab University of Rochester 49

Page 91: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Comparison with Other Proposals

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]

Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]

DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead

5.94

Raj Parihar Advanced Computer Architecture Lab University of Rochester 50

Page 92: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Comparison with Other Proposals

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]

DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead

5.94

Raj Parihar Advanced Computer Architecture Lab University of Rochester 50

Page 93: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Comparison with Other Proposals

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead

5.94

Raj Parihar Advanced Computer Architecture Lab University of Rochester 50

Page 94: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Potential DIY Modules

Loop iterations accessing same cache line, reduction operations

Library function calls: printf, OtsMove, OtsFill etc.

A case in point: mark modified reg() from 176.gcc

Dynamic contribution: 3% Performance speedup: 10%

static void mark_modified_reg (dest, x)

rtx dest; rtx x;

{

int regno, i;

if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);

if (GET_CODE (dest) == MEM) modified_mem = 1;

if (GET_CODE (dest) != REG) return;

regno = REGNO (dest);

if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;

else

for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)

modified_regs[regno + i] = 1;

}

Raj Parihar Advanced Computer Architecture Lab University of Rochester 51

Page 95: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Potential DIY Modules

Loop iterations accessing same cache line, reduction operations

Library function calls: printf, OtsMove, OtsFill etc.A case in point: mark modified reg() from 176.gcc

Dynamic contribution: 3% Performance speedup: 10%

static void mark_modified_reg (dest, x)

rtx dest; rtx x;

{

int regno, i;

if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);

if (GET_CODE (dest) == MEM) modified_mem = 1;

if (GET_CODE (dest) != REG) return;

regno = REGNO (dest);

if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;

else

for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)

modified_regs[regno + i] = 1;

}

Raj Parihar Advanced Computer Architecture Lab University of Rochester 51

Page 96: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Skeleton Payload Distribution

Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches

Optimal only 30% of the time

For the remaining 70% of the time other payloads are optimal

Performance potential of customized payloads: 1.21x

DLA bB

bB+L

2

bB+S

f

bB+L

1

bB+L

2+Sf

bB+L

1+L2

bB+L

1+Sf B

B+L2

B+Sf

B+L1

B+L2+

Sf

B+L1+

L2

B+L1+

Sf

B+L1+

L2+S

f

All-ins

tST

Skeleton Payload (epoch=10k insts)

0

20

40

60

80

# of

bes

t epo

chs

(%) gcc

mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma

Raj Parihar Advanced Computer Architecture Lab University of Rochester 52

Page 97: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Skeleton Payload Distribution

Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches

Optimal only 30% of the time

For the remaining 70% of the time other payloads are optimal

Performance potential of customized payloads: 1.21x

DLA bB

bB+L

2

bB+S

f

bB+L

1

bB+L

2+Sf

bB+L

1+L2

bB+L

1+Sf B

B+L2

B+Sf

B+L1

B+L2+

Sf

B+L1+

L2

B+L1+

Sf

B+L1+

L2+S

f

All-ins

tST

Skeleton Payload (epoch=10k insts)

0

20

40

60

80

# of

bes

t epo

chs

(%) gcc

mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma

Raj Parihar Advanced Computer Architecture Lab University of Rochester 52

Page 98: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Skeleton Payload Tuning Framework

Collects performance of various payloads in regular epoch

Associates static code region with the most optimal payload

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xZZ

Cycles

600700...400

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xZZ

Cycles

500700...400

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xAA

Cycles

500700...400

(A) Initial payload performance

B+L1+L2

bB+L1

bB+L2

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xAA

Best Skt

B+L1+L2bB+L2...bB+L1

(B) Best payload per epoch

PC

0xAA0xBB…0xZZ

<Payload#: Cnt> tuples

#1:50, #2:30,..., #N:100#4:10, #5:50...#1:10, #2:30,..., #N:10

(C) Per PC payload tuples

PC

0xAA0xBB…0xZZ

Best payload

#N:100#5:50...#2:30

(D) Best payload per PC

0xAA: bB+L20xZZ:

bB+L10xBB: B+L1+L2

(E) Final skeleton

Skeleton Payloads

Raj Parihar Advanced Computer Architecture Lab University of Rochester 53

Page 99: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Performance Impact of Duty Cycle

One DIY call example from 179.art

5 10 20 25 50 75 80 90 100

Duty Cycle (not to the scale)

0

2

4

6

8

Per

form

ance

gai

n ov

er D

LA (

%)

WeightAdj() in 179.art

Raj Parihar Advanced Computer Architecture Lab University of Rochester 54

Page 100: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Weak Dependence: Insights and Findings

The evolution process is remarkably robust

Different inputs and configuration do not invalidate resultsCan use sampling to accelerate fitness test w/o appreciable impacton quality of solution found

Energy reduction → due to less activity and stalling

About 10% dynamic instructions removed from skeleton11% energy saving over baselne decoupled look-ahead

Impact of weak insts removal on look-ahead quality is very small

Similar prefetch and branch hint accuracy

Raj Parihar Advanced Computer Architecture Lab University of Rochester 55

Page 101: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Comparison with Speculative Parallel Look-ahead

Self-tuned skeleton is used in the speculative parallel look-ahead

In some cases, self-tuned and speculative parallel look-ahead

techniques are synergistic (ammp, art)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 56

Page 102: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

1

13

18

1

2

1

0x120011490 ldt $f0, 0(a0)

1.

1

0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)

0x1200114bc stt $f0, 0(a2)

0x12000da84 lda a5, 744(sp)

3. 0x12000daec lda a5, 4(a5)

2. 0x12000dac0 ldl t7, 0(a5)

4. 0x120011984 ldq a0, 80(sp)

5.

6.

7. 0x1200119f8 bis 0, t8, t11

8. 0x120011b04 lda a0, 8(a0)

1

2

1

A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11

Raj Parihar Advanced Computer Architecture Lab University of Rochester 57

Page 103: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

1

13

18

1

2

1

0x120011490 ldt $f0, 0(a0)

1.

1

0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)

0x1200114bc stt $f0, 0(a2)

0x12000da84 lda a5, 744(sp)

3. 0x12000daec lda a5, 4(a5)

2. 0x12000dac0 ldl t7, 0(a5)

4. 0x120011984 ldq a0, 80(sp)

5.

6.

7. 0x1200119f8 bis 0, t8, t11

8. 0x120011b04 lda a0, 8(a0)

1

2

1

A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11

Raj Parihar Advanced Computer Architecture Lab University of Rochester 57

Page 104: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Advanced Computer Architecture Lab University of Rochester 58

Page 105: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Advanced Computer Architecture Lab University of Rochester 58

Page 106: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Advanced Computer Architecture Lab University of Rochester 58

Page 107: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0

1.2

1.4

1.6

1.8

2.0

App

roxi

mat

e P

aral

lelis

m

Original binarySkeleton

Raj Parihar Advanced Computer Architecture Lab University of Rochester 59

Page 108: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)

Loop based FP applications exhibit more BB level parallelism

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0

1.2

1.4

1.6

1.8

2.0

App

roxi

mat

e P

aral

lelis

m

Original binarySkeleton

Raj Parihar Advanced Computer Architecture Lab University of Rochester 59

Page 109: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0

1.2

1.4

1.6

1.8

2.0

App

roxi

mat

e P

aral

lelis

m

Original binarySkeleton

Raj Parihar Advanced Computer Architecture Lab University of Rochester 59

Page 110: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

789101112131415

1617181920212223

1

3456

2

Lookahead thread 0 Lookahead thread 1

Tim

e

Duplicate rename tableand set up context

Merge

Spawn

Cleanupduplicated state

Raj Parihar Advanced Computer Architecture Lab University of Rochester 60

Page 111: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

789101112131415

1617181920212223

1

3456

2

Lookahead thread 0 Lookahead thread 1

Tim

e

Duplicate rename tableand set up context

Merge

Spawn

Cleanupduplicated state

Raj Parihar Advanced Computer Architecture Lab University of Rochester 60

Page 112: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Speedup of Speculative Parallelization

Applications in which the look-ahead thread is a bottleneck

Speculative look-ahead over decoupled look-ahead: 1.13x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1

2

3

4

5

Spe

edup

ove

r si

ngle

−th

read

Baseline look−ahead

Speculatively parallel look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 61

Page 113: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Speedup of Speculative Parallelization

Applications in which the look-ahead thread is a bottleneck

Speculative look-ahead over decoupled look-ahead: 1.13x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1

2

3

4

5

Spe

edup

ove

r si

ngle

−th

read

Baseline look−ahead

Speculatively parallel look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 61

Page 114: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8

1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line

Speculatively parallel mainSpeculatively parallel look−ahead

1.65

Raj Parihar Advanced Computer Architecture Lab University of Rochester 62

Page 115: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13x

Speculative main thread over single thread baseline: 1.07x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8

1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line

Speculatively parallel mainSpeculatively parallel look−ahead

1.65

Raj Parihar Advanced Computer Architecture Lab University of Rochester 62

Page 116: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8

1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line

Speculatively parallel mainSpeculatively parallel look−ahead

1.65

Raj Parihar Advanced Computer Architecture Lab University of Rochester 62

Page 117: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8

1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line

Speculatively parallel mainSpeculatively parallel look−ahead

1.65IPC

Raj Parihar Advanced Computer Architecture Lab University of Rochester 62

Page 118: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Baseline Cache Partitioning

Baseline (naive) cache partitioning/sharing policies:

Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches

Two extremes: Resource protection vs. Capacity utilization

Unrelated program co-run: individual slowdowns may not bejustifiable if from different users

Unlike slowing down a thread occasionally to improve throughput

Cache rationing: achieves good cache protection and cache

utilization without slowing down individual programs

Raj Parihar Advanced Computer Architecture Lab University of Rochester 63

Page 119: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Baseline Cache Partitioning

Baseline (naive) cache partitioning/sharing policies:

Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches

Two extremes: Resource protection vs. Capacity utilization

Unrelated program co-run: individual slowdowns may not bejustifiable if from different users

Unlike slowing down a thread occasionally to improve throughput

Cache rationing: achieves good cache protection and cache

utilization without slowing down individual programs

Raj Parihar Advanced Computer Architecture Lab University of Rochester 63

Page 120: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Baseline Cache Partitioning

Baseline (naive) cache partitioning/sharing policies:

Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches

Two extremes: Resource protection vs. Capacity utilization

Unrelated program co-run: individual slowdowns may not bejustifiable if from different users

Unlike slowing down a thread occasionally to improve throughput

Cache rationing: achieves good cache protection and cache

utilization without slowing down individual programs

Raj Parihar Advanced Computer Architecture Lab University of Rochester 63

Page 121: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Microthreads vs Decoupled Look-ahead

Lightweight Microthreads: Decoupled Look-ahead:

Raj Parihar Advanced Computer Architecture Lab University of Rochester 64

Page 122: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Look-ahead Skeleton Construction

Raj Parihar Advanced Computer Architecture Lab University of Rochester 65

Page 123: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Under-clocked Dual-core Speedup

Typically a dual-core can be clocked only upto 90% clock

frequency of a single-core system

After adjusting the frequency of single-core

Single-core IPC: 1.80 (INT), 2.28 (FP), 2.05 (Combined)

Baseline look-ahead over 10% over-clocked single-thread

Speedup: 1.13x (INT), 1.34x (FP), 1.24x (Combined)

Self-tuned look-ahead over single-thread: (for 14 applications)

Speedup: 1.20x (INT), 1.96x (FP), 1.43x (Combined)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 66

Page 124: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Self-tuned Look-ahead: SPEC 2006

Self-tuned look-ahead achieves 1.10x speedup over baseline

look-ahead for SPEC CPU 2006 applications

perl bzp gcc mcf go hmer sjen libq h264 omn astr xaln milc deal splx Gmean1

2

3

4

5678

Spe

edup

ove

r si

ngle

−th

read

Baseline look−ahead

GA based look−ahead

Raj Parihar Advanced Computer Architecture Lab University of Rochester 67

Page 125: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Self-tuned Look-ahead: Speedup Analysis

A larger code (with more genes) takes slightly more time to evolve

0 10,000 20,000 30,000

1

10

100

# of static instructions

Rel

ativ

e P

erfo

rman

ce G

ain

Spec 2006spec 2000

Liner Regression Line( r = −0.46 )

Ideal − DLA

GA − DLARelative Performance Gain =

Raj Parihar Advanced Computer Architecture Lab University of Rochester 68

Page 126: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Self-tuned Look-ahead: Speedup Analysis

Performance gain has strong correlation with # of generations

0 10,000 20,000 30,000

2

4

6

8

10

# of static instructions

Sat

urat

ion

Gen

erat

ion

Spec 2006

Spec 2000

Liner Regression Line(r = 0.56)

Saturation generation (>= 90% of the best GA solution)

Raj Parihar Advanced Computer Architecture Lab University of Rochester 69

Page 127: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Partial Recovery in Speculative Parallelization

Raj Parihar Advanced Computer Architecture Lab University of Rochester 70

Page 128: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Flexibility in Look-ahead Hardware Design

Comparison of regular (partial versioning) cache support with twoother alternatives

No cache versioning supportDependence violation detection and squash

Raj Parihar Advanced Computer Architecture Lab University of Rochester 71

Page 129: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Genetic Algorithm Evolution

Raj Parihar Advanced Computer Architecture Lab University of Rochester 72

Page 130: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Multi-instruction Gene Examples

Raj Parihar Advanced Computer Architecture Lab University of Rochester 73

Page 131: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Superposition based Chromosomes

Raj Parihar Advanced Computer Architecture Lab University of Rochester 74

Page 132: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Recovery based Early Termination of Fitness Test

Raj Parihar Advanced Computer Architecture Lab University of Rochester 75

Page 133: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Optimizations to Implementation

Fitness test optimizations

Sampling based fitnessMulti-instruction genesEarly termination of tests

GA framework optimizations

Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy

Raj Parihar Advanced Computer Architecture Lab University of Rochester 76

Page 134: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Sampling based Fitness Test

Raj Parihar Advanced Computer Architecture Lab University of Rochester 77

Page 135: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

L2 Cache Sensitivity Study

Speedup for various L2 caches is quite stable

1.139x (1 MB), 1.133x (2 MB), and 1.131x (4 MB) L2 caches

Avg. speedups, shown in the figure, are relative to

single-threaded execution with a 1 MB L2 cache

Raj Parihar Advanced Computer Architecture Lab University of Rochester 78

Page 136: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Approximable Program Paradigm

Weak dependence removal and speculative parallelization

techniques can be applied to any approximate program

Few real-life examples of approximate computing

Google search: does not work with a coherent, up-to-date databaseMap-Reduce paradigm: ignores consistently failing recordsMedia applications: photo, audio and video have some tolerance

Algorithm and applications level approximations

Modern benchmarks e.g., PARSEC are fundamentally approximateApplications space: clustering, predictions, optimizations, etc.BenchNN: a neural network based alternative to PARSEC

Raj Parihar Advanced Computer Architecture Lab University of Rochester 79

Page 137: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

References (Partial)

Decoupled Access/Execute Computer ArchitecturesJ. Smith, ACM TC’84

A Study of Slipstream ProcessorsZ. Purser, K. Sundaramoorthy, E. Rotenberg, MICRO’00

Master/Slave Speculative ParallelizationC. Zilles, G. Sohi, MICRO’02

A Performance-Correctness Explicitly Decoupled ArchitectureA. Garg, M. Huang, MICRO’08

Speculative Parallelization in Decoupled Look-aheadA. Garg, R. Parihar, M. Huang, PACT’11

Accelerating Decoupled Look-ahead via Weak Dependence Removal: AMetaheuristic ApproachR. Parihar, M. Huang, HPCA’14

Raj Parihar Advanced Computer Architecture Lab University of Rochester 80