accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY

Raj Parihar Advanced Computer Architecture Lab University of Rochester



Motivation

Despite the proliferation of multi-core, multi-threaded systems

High single-thread performance is still an important CPU design goal

Modern programs do not lack instruction level parallelism

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

50

IPC

ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K

Raj Parihar Advanced Computer Architecture Lab University of Rochester 2



Motivation

Despite the proliferation of multi-core, multi-threaded systems

High single-thread performance is still an important CPU design goal

Modern programs do not lack instruction level parallelism


10

50

IPC


Real challenge: exploit implicit parallelism without undue cost

One effective approach: Decoupled look-ahead architecture




Motivation

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling




Motivation




10

50

IPC





Motivation



The look-ahead thread can often become a new bottleneck

Lack of correctness constraint allows many optimizations

Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread




Outline

Motivation

Baseline decoupled look-ahead

Look-ahead: a new bottleneck

Look-ahead thread acceleration

Weak dependences/instructions

Do-It-Yourself branches & skeleton tuning

Experimental analysis






Baseline Decoupled Look-ahead Architecture

Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and

Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching

Main Core

Branch QueueLook-ahead Core

L0$ L1$

Executes Look-aheadskeleton

Executes programbinary

L2$

Register state synchronization

Prefetching hints

Branch prediction1

2

Main Memory

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO’08





Baseline Decoupled Look-ahead Architecture

Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and

Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching

Main Core

Branch QueueLook-ahead Core

L0$ L1$

Executes Look-aheadskeleton

Executes programbinary

L2$

Register state synchronization

Prefetching hints

Branch prediction1

2addq v0, v0, v0nop......bgt a1, 0x12001f9a0subq v0, t0, a2

addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0subq a1, 0x2, a1addq v0, v0, v0bgt a1, 0x12001f9a0subq v0, t0, a2

Main Memory

Program binary

Skeleton

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08





Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead limit

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck

aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0

1

2

3

4

IPC

Single thread







Application categories:Bottleneck removed; speed of look-ahead is not an issue (left half)Look-ahead thread is the new bottleneck


1

2

3

4

IPC

Single−thread Decoupled Look−ahead







Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck


1

2

3

4

IPC

Single−thread Decoupled Look−ahead Ideal (cache, branch)







Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)


1

2

3

4

IPC

Look−ahead limit Single−thread Decoupled Look−ahead Ideal (cache, branch)




Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Weak Dependences/Instructions

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs (100s of)

Weak instruction can be experimentally defined and their

impact quantified in isolation





Challenge # 1: Weak insts do not look different

After the fact analysis: based on static attributes of insts reveals

Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)

Weakness has very poor correlation with static attributes

Hard to identify the weak instructions through static heuristics

addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0

1

2

Instruction Type

Num

ber

of In

puts

WeakInstructions

addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0

1

2

Instruction Type

Num

ber

of In

puts

StrongInstructions





Challenge # 2: False positives are extremely costly

After the fact analysis and close inspection also reveals

Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains

Case in point: zapnot in gap

zapnot Ra Rb Rc

84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%





Challenge # 3: Neither absolute nor additive

Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!

Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown

0 50 100 150 200 250 300−40%

−30%

−20%

−10%

0%

10%

20%

Cummulative weak instructions

Per

form

ance

impa

ct o

ver

base

line

look

−ah

ead

perlbmk





Metaheuristic Based Trail-and-Error Approach

Recap: Challenges in identifying weak instructions

Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive

Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic

Metaheuristic: Completely agnostic of meaning of solution

Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.

R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14





Genetic Algorithm based Framework

The problem naturally maps to genetic algorithm

Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)

Genetic evolution: Procreation, mutation, fitness-based selection

ProgramBinary

Look-aheadBinary

Chromosome creation GA evolution

Single-GeneChromosome

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

Fitness test,Elitism

Initi

al C

hrom

osom

e P

opul

atio

n

Look-ahead construction

Sin

gle-

Inst

ruct

ion

Gen

es

(Binary Parser)

12

3

4

5

6

7

8

Mul

ti-In

stru

ctio

n G

enes

SuperpositionChromosome

OrthogonalChromosome

Xover &Mutation

De-duplication





Speedup of Weak Dependence Removal

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x

craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1

2

3

4

5

6

Spe

edup

ove

r si

ngle

−th

read

Baseline look−aheadGA based look−ahead








Speedup over baseline decoupled look-ahead: 1.11x (geomean)

Overall speedup over single-thread baseline: 1.48x


2

3

4

5

6

Spe

edup

ove

r si

ngle

−th

read









Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x


2

3

4

5

6

Spe

edup

ove

r si

ngle

−th

read






Progress of Genetic Evolution Process

Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved

GA evolution, helped by hybridization shows good progress

1 2 3 4 5 6 70%

20%

40%

60%

80%

100%

# of Generations

Pro

gres

s re

lativ

e to

bes

t GA

sol

utio

n

eon

mcf

pbmk

twolf

vpr

art

eqk

fma

amp

lucas





Evolution can be Online or Offline

Offline evolution: one time tuning (e.g. install time)

Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result

Online evolution: takes longer but has little overhead

Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations

1

1.5

2

2.5

3

1 116

231

346

461

576

691

806

921

1036

1151

1266

1381

1496

1611

1726

1841

1956

2071

2186

2301

2416

2531

2646

2761

2876

2991

3106

3221

3336

3451

3566

3681

3796

3911

4026

4141

4256

4371

4486

4601

4716

Accumulated

IPC

Number of instruc6ons (in millions)

Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead





A Locomotive and Cargo Analogy

Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload

Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo

(under utilization of locomotive’s capability)

L1 prefetches L2 prefetches





Idea of Do-It-Yourself (DIY) Branches

Extends the idea of weak instructions to easy-to-predict branches

To accelerate the look-ahead thread via DIY branches

Either skip completely or partially execute in the skeleton

BR

B

C

BR

BA

C

BR

BA

C

(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]

BR

A

(4) DIY [A -> C]

B

C

A

BR

A

(5) DIY [A -> B -> BR -> C]

B

C

ZAP

ZAP

LEFT

FALL

RIGHT

(A) Forward conditional branch (If-Than, If-Than-Else) transformations

(B) Backward conditional branch (Loop) transformations

Tune skeleton via selectively including/excluding prefetches

R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)






Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches


BR

B

C

BR

BA

C

BR

BA

C


BR

A

(4) DIY [A -> C]

B

C

A

BR

A

(5) DIY [A -> B -> BR -> C]

B

C

ZAP

ZAP

LEFT

FALL

RIGHT



Tune skeleton via selectively including/excluding prefetches

R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)






Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches


BR

B

C

BR

BA

C

BR

BA

C


BR

A

(4) DIY [A -> C]

B

C

A

BR

A

(5) DIY [A -> B -> BR -> C]

B

C

ZAP

ZAP

LEFT

FALL

RIGHT



Tune skeleton via selectively including/excluding prefetchesR. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)





Hardware Support for DIY Branches

Hardware support needed to synchronize after DIY regions

Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion

Look-aheadThread

MainThread

Branch Queue

Branch Predictor

DIY Mode

i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]

Direction+

DIY info

DIY call depth register+

DIY mode bit

Skeleton [mask, diy, duty]





Experimental Setup

Program/binary analysis tool: ALTO

Simulator: detailed out-of-order,cycle-level in-house

SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)

Genetic algorithm framework

Modeled as offline and onlineextension to the simulator

Microarchitectural configurations:Baseline core (Similar to POWER5)

Fetch/Decode/Issue/Commit 8 / 4 / 6 / 6ROB 128Functional units INT 2+1 mul +1 div, FP 2+1 mul +1 divFetch Q/ Issue Q / Reg. (int,fp) (32, 32) / (32, 32) / (80, 80)LSQ(LQ,SQ) 64 (32,32) 2 search portsBranch predictor Gshare – 8K entries, 13 bit historyBr. mispred. penalty at least 7 cyclesL1 data cache (private) 32KB, 4-way, 64B line, 2 cycles, 2 portsL1 inst cache (private) 64KB, 2-way, 128B, 2 cyclesL2 cache (shared) 1MB, 8-way, 128B, 15 cyclesMemory access latency 200 cyclesLook-ahead core: Baseline core with only LQ, no SQ

L0 cache: 32KB, 4-way, 64B line, 2 cyclesRound trip latency to L1: 6 cycles

Communication: Branch Output Queue: 512 entriesReg copy latency (recovery): 64 cycles

Table 1: Microarchitectural configurations.





Individual Performance Gains

Speedup of DIY branches over baseline look-ahead: 1.08x

Speedup of Skeleton Payload Tuning: 1.12x

Combined speedup (DIY + Payload Tuning): 1.15x

gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean

1.0

1.4

1.8

2.2

Spe

edup

ove

r si

ngle

-thr

ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead

15%

Weak InstsRemoval(16.2%)





Overall Performance Gain

Final decoupled look-ahead system

Skeleton payload tuning + Weak dependence + DIY branches

Performance speedup over:

Baseline look-ahead: 1.20x Single-thread: 1.61x

gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0

1.2

1.4

1.6

1.8

2.0

Spe

edup

ove

r B

asel

ine

DLA

Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA




Salient Features of Decoupled Look-ahead

Hard-to-predict, data-dependent branches

Conventional predictors do not capture data-dependent behaviors

Prefetching for cold misses

Conventional prefetchers take time to learn access/miss patterns

Other potential advantages:

Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)

Potential hurdles and showstoppers:

If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically




Future Explorations

Effective look-ahead to improve L1 prefetching performance

Speeding up critical threads and serial bottlenecks via a shared

look-ahead agent in multi-threaded applications

Cost effective SMT implementation of decoupled look-ahead

Role of look-ahead to promote parallelization, value predictions

and acceleration of interpreted programs

Backward strawman: integrate non-speculative look-ahead

computations in the main thread directly




Details in the Thesis and Papers

Decoupled Look-ahead Architecture:

Weak dependence removal in decoupled look-ahead [HPCA’14]

Load balancing in look-ahead via DIY branches [PACT-SRC’15]

Speculative parallelization in decoupled look-ahead [PACT’11]

DIY branches and payload tuning [in prep. for HPCA’17]

Shared Cache Management:

Hardware support for protective and collaborative caches [ISMM’16]

Protection and utilization in shared cache via rationing [PACT’14]

A coldness metric for cache optimization [MSPC’13]




Motivation: Cache Rationing

Compute systems with shared resources are prevalent today

Multi-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources




Motivation: Cache Rationing

Compute systems with shared resources are prevalent todayMulti-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources

Significant performance loss due to co-run interference: >25%

Equal partitioning No partitioning Rationing PIPP-equal0.7

0.8

0.9

1.0

1.1

1.2

1.3

IPC

Nor

m. t

o so

lo r

un w

/ 512

KB

L2$

2.32

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with equake(2 cores, 1 MB L2 cache)




Idea of Cache Rationing

Achieve resource protection and utilization both simultaneously

Rationing policy:

Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs

Conservative sharing: provides a safety net for less aggressive

programs in the presence of non cooperative programs

R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16




Hardware Support

Ration Accounting: ration counter-register pairs

To track the current usage of a program, maintained per core per set

Usage Tracking: access-bit and block owner

To detect unused ration and ensure entitlement, 1 per cache blk

blk 1 blk w-1

w ways

Data array

Access bit

p counter-register pairs

s se

ts

Ration tracker

w ways

Status bit

Tag array

Block ownerRation counter

Owner allocation

Additional storage overhead: <1% of total cache storage




Resource Protection Co-Run

Co-run with a high-pressure peer (mcf)

Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage


0.8

0.9

1.0

1.1

1.2

1.3

IPC

nor

m. t

o so

lo r

un w

/ 512

KB

L2$

... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...

1.52

SPEC 2000: with mcf(2 cores, 1 MB L2 cache)

INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf

FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi




Resource Protection Co-Run

Co-run with a high-pressure peer (mcf)Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage


0.8

0.9

1.0

1.1

1.2

1.3

IPC

nor

m. t

o so

lo r

un w

/ 512

KB

L2$

... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...

1.52

SPEC 2000: with mcf(2 cores, 1 MB L2 cache)

INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf

FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi




Capacity Utilization Co-Run

Co-run with a low-pressure peer (eon): cache demand <128 KB

Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs


1.0

1.2

1.4

IPC

Nor

m. t

o so

lo r

un w

/ 512

KB

L2$

1.744

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with eon(2 cores, 1 MB L2 cache)




Summary

Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations

Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries

Metaheuristic based self-tuning approach is simple and robust

Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Multi-threaded workload can benefit from speeding up serial

sections and bottleneck threads in critical regions




Summary




Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling


sections and bottleneck threads in critical regions




Summary




Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling


sections and bottleneck threads in critical regionsRaj Parihar Advanced Computer Architecture Lab University of Rochester 38



Acknowledgments

Funding agencies: NSF, NSFC

Prof. Michael C. Huang Alok Garg

Prof. Chen Ding and his research group at URCS

Past & current members of Advanced Computer Architecture

Lab at University of Rochester




Backup Slides

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY




Summary of Distillation Techniques

Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs

Stores would have been committed by main thread

Selective value (zero) substitution for the L2 misses

If the look-ahead distance drops below a threshold

Speculative parallelization of skeleton w/o any rollback support

Weak dependence/instruction removal from skeleton

Consecutive loop iterations accessing the same cache line

A wide variety of library calls and reduction operations

Selective payload: eliminate payload if it slows look-ahead




Practical Advantages of Decoupled Look-ahead

Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)

Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit (ideal)





Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)

Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution


1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit (ideal)Decoupled look−ahead

4.0





Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Cache misses90% 95%

DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68

Branch mispredictions90% 95%

DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236




Correlation with RTL/FPGA Accurate Simulator

Reported performance improvement results are very pessimistic

Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces

RTL accurate simulator: shows 2x more performance potential

Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0

1.2

1.4

1.6

1.8

2.0

2.2

Spe

edup

ove

r si

ngle

-thr

ead SimpleScalar

IMG-psim

1.56x(projected)




Simplified Look-ahead Core

Baseline skeleton: 71% After distillation: 57%

2-wide look-ahead core: (Front end is still 8-wide)

2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%

Renam

eROB

Int I

QFp

IQ

Decod

e

RAT-dec

RAT-wl

RAT-bl

DCL-cm

pLS

QTot

al0

20

40

60

80

100

% P

ower

(N

orm

. to

4-w

ide

DLA

)

3-wide Look-ahead core2-wide Look-ahead core

Look-ahead core components




Simplified Look-ahead Core

Baseline skeleton: 71% After distillation: 57%2-wide look-ahead core: (Front end is still 8-wide)

2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%

Renam

eROB

Int I

QFp

IQ

Decod

e

RAT-dec

RAT-wl

RAT-bl

DCL-cm

pLS

QTot

al0

20

40

60

80

100

% P

ower

(N

orm

. to

4-w

ide

DLA

)

3-wide Look-ahead core2-wide Look-ahead core

Look-ahead core components




Baseline vs. Tuned Skeleton

Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed

Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead

4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1

1.1

1.2

1.3

1.4

1.5

Spe

edup

ove

r S

ingl

e-T

hrea

d

Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead

INT FP




Hybridization: Heuristically Designed Initial Solutions

Genetic evolution could be a slow and lengthy process

Heuristic based solutions are helpful to jump start the evolution

Heuristically designed solutions in our system:

Superposition chromosome; Orthogonal subroutine chromosome

X

Multi-Instruction Genes

X XX X X

X X

X XX X XX X X X

X X X X X X X X X

Initial Chromosomes

A B CSubroutines

(b) Superposition Chromosomes

(c) Orthogonal Chromosomes

(a) Single-gene Chromosomes

Single-Instruction Genes

Chromosome

XX

XX

X




Online Genetic Evolution: equake

Primary overhead comes from testing bad skeleton configs

Break-even point: 1.8 billion insts (1-2 sec of native execution)By 4.6 billion insts: overall cumulative speed is already 10% faster

1

1.5

2

2.5

3

1 138

275

412

549

686

823

960

1097

1234

1371

1508

1645

1782

1919

2056

2193

2330

2467

2604

2741

2878

3015

3152

3289

3426

3563

3700

3837

3974

4111

4248

4385

4522

4659 Ac

cumulated

IPC

Number of instruc6ons (1 epoch = 1 million instruc6ons)


0.5 1

1.5 2

2.5 3

3.5

1 13

8 27

5 41

2 54

9 68

6 82

3 96

0 10

97

1234

13

71

1508

16

45

1782

19

19

2056

21

93

2330

24

67

2604

27

41

2878

30

15

3152

32

89

3426

35

63

3700

38

37

3974

41

11

4248

43

85

4522

46

59

Distrib

uted

IPC

Number of instruc4ons (1 epoch = 1 million instruc4ons)





Comparison with Other Proposals

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]

Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]

DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)


1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read

Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead

5.94





Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]

DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)


1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read


5.94





Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)


1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Spe

edup

ove

r si

ngle

−th

read


5.94




Potential DIY Modules

Loop iterations accessing same cache line, reduction operations

Library function calls: printf, OtsMove, OtsFill etc.

A case in point: mark modified reg() from 176.gcc

Dynamic contribution: 3% Performance speedup: 10%

static void mark_modified_reg (dest, x)

rtx dest; rtx x;

{

int regno, i;

if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);

if (GET_CODE (dest) == MEM) modified_mem = 1;

if (GET_CODE (dest) != REG) return;

regno = REGNO (dest);

if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;

else

for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)

modified_regs[regno + i] = 1;

}




Potential DIY Modules

Loop iterations accessing same cache line, reduction operations

Library function calls: printf, OtsMove, OtsFill etc.A case in point: mark modified reg() from 176.gcc

Dynamic contribution: 3% Performance speedup: 10%

static void mark_modified_reg (dest, x)

rtx dest; rtx x;

{

int regno, i;

if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);

if (GET_CODE (dest) == MEM) modified_mem = 1;

if (GET_CODE (dest) != REG) return;

regno = REGNO (dest);

if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;

else

for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)

modified_regs[regno + i] = 1;

}




Skeleton Payload Distribution

Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches

Optimal only 30% of the time

For the remaining 70% of the time other payloads are optimal

Performance potential of customized payloads: 1.21x

DLA bB

bB+L

2

bB+S

f

bB+L

1

bB+L

2+Sf

bB+L

1+L2

bB+L

1+Sf B

B+L2

B+Sf

B+L1

B+L2+

Sf

B+L1+

L2

B+L1+

Sf

B+L1+

L2+S

f

All-ins

tST

Skeleton Payload (epoch=10k insts)

0

20

40

60

80

# of

bes

t epo

chs

(%) gcc

mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma




Skeleton Payload Tuning Framework

Collects performance of various payloads in regular epoch

Associates static code region with the most optimal payload

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xZZ

Cycles

600700...400

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xZZ

Cycles

500700...400

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xAA

Cycles

500700...400

(A) Initial payload performance

B+L1+L2

bB+L1

bB+L2

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xAA

Best Skt

B+L1+L2bB+L2...bB+L1

(B) Best payload per epoch

PC

0xAA0xBB…0xZZ

<Payload#: Cnt> tuples

#1:50, #2:30,..., #N:100#4:10, #5:50...#1:10, #2:30,..., #N:10

(C) Per PC payload tuples

PC

0xAA0xBB…0xZZ

Best payload

#N:100#5:50...#2:30

(D) Best payload per PC

0xAA: bB+L20xZZ:

bB+L10xBB: B+L1+L2

(E) Final skeleton

Skeleton Payloads




Performance Impact of Duty Cycle

One DIY call example from 179.art

5 10 20 25 50 75 80 90 100

Duty Cycle (not to the scale)

0

2

4

6

8

Per

form

ance

gai

n ov

er D

LA (

%)

WeightAdj() in 179.art




Weak Dependence: Insights and Findings

The evolution process is remarkably robust

Different inputs and configuration do not invalidate resultsCan use sampling to accelerate fitness test w/o appreciable impacton quality of solution found

Energy reduction → due to less activity and stalling

About 10% dynamic instructions removed from skeleton11% energy saving over baselne decoupled look-ahead

Impact of weak insts removal on look-ahead quality is very small

Similar prefetch and branch hint accuracy




Comparison with Speculative Parallel Look-ahead

Self-tuned skeleton is used in the speculative parallel look-ahead

In some cases, self-tuned and speculative parallel look-ahead

techniques are synergistic (ammp, art)




Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

1

13

18

1

2

1

0x120011490 ldt $f0, 0(a0)

1.

1

0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)

0x1200114bc stt $f0, 0(a2)

0x12000da84 lda a5, 744(sp)

3. 0x12000daec lda a5, 4(a5)

2. 0x12000dac0 ldl t7, 0(a5)

4. 0x120011984 ldq a0, 80(sp)

5.

6.

7. 0x1200119f8 bis 0, t8, t11

8. 0x120011b04 lda a0, 8(a0)

1

2

1

A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11




Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited




Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0

1.2

1.4

1.6

1.8

2.0

App

roxi

mat

e P

aral

lelis

m

Original binarySkeleton






Skeleton exhibits significant more BB level parallelism (17%)

Loop based FP applications exhibit more BB level parallelism


1.2

1.4

1.6

1.8

2.0

App

roxi

mat

e P

aral

lelis

m







Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism


1.2

1.4

1.6

1.8

2.0

App

roxi

mat

e P

aral

lelis

m





Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

789101112131415

1617181920212223

1

3456

2

Lookahead thread 0 Lookahead thread 1

Tim

e

Duplicate rename tableand set up context

Merge

Spawn

Cleanupduplicated state




Speedup of Speculative Parallelization


Speculative look-ahead over decoupled look-ahead: 1.13x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1

2

3

4

5

Spe

edup

ove

r si

ngle

−th

read

Baseline look−ahead

Speculatively parallel look−ahead




Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x


1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line

Speculatively parallel mainSpeculatively parallel look−ahead

1.65






Speculative look-ahead over decoupled LA baseline: 1.13x

Speculative main thread over single thread baseline: 1.07x


1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line


1.65








1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line


1.65








1.0

1.2

1.4

1.6

Spe

edup

ove

r re

spec

tive

base

line


1.65IPC




Baseline Cache Partitioning

Baseline (naive) cache partitioning/sharing policies:

Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches

Two extremes: Resource protection vs. Capacity utilization

Unrelated program co-run: individual slowdowns may not bejustifiable if from different users

Unlike slowing down a thread occasionally to improve throughput

Cache rationing: achieves good cache protection and cache

utilization without slowing down individual programs




Microthreads vs Decoupled Look-ahead

Lightweight Microthreads: Decoupled Look-ahead:




Look-ahead Skeleton Construction




Under-clocked Dual-core Speedup

Typically a dual-core can be clocked only upto 90% clock

frequency of a single-core system

After adjusting the frequency of single-core

Single-core IPC: 1.80 (INT), 2.28 (FP), 2.05 (Combined)

Baseline look-ahead over 10% over-clocked single-thread

Speedup: 1.13x (INT), 1.34x (FP), 1.24x (Combined)

Self-tuned look-ahead over single-thread: (for 14 applications)

Speedup: 1.20x (INT), 1.96x (FP), 1.43x (Combined)




Self-tuned Look-ahead: SPEC 2006

Self-tuned look-ahead achieves 1.10x speedup over baseline

look-ahead for SPEC CPU 2006 applications

perl bzp gcc mcf go hmer sjen libq h264 omn astr xaln milc deal splx Gmean1

2

3

4

5678

Spe

edup

ove

r si

ngle

−th

read

Baseline look−ahead

GA based look−ahead




Self-tuned Look-ahead: Speedup Analysis

A larger code (with more genes) takes slightly more time to evolve

0 10,000 20,000 30,000

1

10

100

# of static instructions

Rel

ativ

e P

erfo

rman

ce G

ain

Spec 2006spec 2000

Liner Regression Line( r = −0.46 )

Ideal − DLA

GA − DLARelative Performance Gain =




Self-tuned Look-ahead: Speedup Analysis

Performance gain has strong correlation with # of generations

0 10,000 20,000 30,000

2

4

6

8

10

# of static instructions

Sat

urat

ion

Gen

erat

ion

Spec 2006

Spec 2000

Liner Regression Line(r = 0.56)

Saturation generation (>= 90% of the best GA solution)




Partial Recovery in Speculative Parallelization




Flexibility in Look-ahead Hardware Design

Comparison of regular (partial versioning) cache support with twoother alternatives

No cache versioning supportDependence violation detection and squash




Genetic Algorithm Evolution




Multi-instruction Gene Examples




Superposition based Chromosomes




Recovery based Early Termination of Fitness Test




Optimizations to Implementation

Fitness test optimizations

Sampling based fitnessMulti-instruction genesEarly termination of tests

GA framework optimizations

Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy




Sampling based Fitness Test




L2 Cache Sensitivity Study

Speedup for various L2 caches is quite stable

1.139x (1 MB), 1.133x (2 MB), and 1.131x (4 MB) L2 caches

Avg. speedups, shown in the figure, are relative to

single-threaded execution with a 1 MB L2 cache




Approximable Program Paradigm

Weak dependence removal and speculative parallelization

techniques can be applied to any approximate program

Few real-life examples of approximate computing

Google search: does not work with a coherent, up-to-date databaseMap-Reduce paradigm: ignores consistently failing recordsMedia applications: photo, audio and video have some tolerance

Algorithm and applications level approximations

Modern benchmarks e.g., PARSEC are fundamentally approximateApplications space: clustering, predictions, optimizations, etc.BenchNN: a neural network based alternative to PARSEC




References (Partial)

Decoupled Access/Execute Computer ArchitecturesJ. Smith, ACM TC’84

A Study of Slipstream ProcessorsZ. Purser, K. Sundaramoorthy, E. Rotenberg, MICRO’00

Master/Slave Speculative ParallelizationC. Zilles, G. Sohi, MICRO’02

A Performance-Correctness Explicitly Decoupled ArchitectureA. Garg, M. Huang, MICRO’08

Speculative Parallelization in Decoupled Look-aheadA. Garg, R. Parihar, M. Huang, PACT’11

Accelerating Decoupled Look-ahead via Weak Dependence Removal: AMetaheuristic ApproachR. Parihar, M. Huang, HPCA’14


accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...

Documents