runahead execution - computer action teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• runahead...

33
Runahead Execution An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu Yale N. Patt Jared Stark Chris Wilkerson

Upload: others

Post on 04-Mar-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

Runahead Execution

An Alternative to Very Large Instruction Windows for Out-of-order Processors

Onur MutluYale N. Patt

Jared StarkChris Wilkerson

Page 2: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

2

Outline

• Motivation

• Overview

• Mechanism

• Experimental Evaluation

• Conclusions

Page 3: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

3

Motivation

• Out-of-order processors require very large instruction windows to tolerate today’s main memory latencies.– Even in the presence of caches and prefetchers

• As main memory latency (in terms of processor cycles) increases, instruction window size should also increase to fully tolerate the memory latency.

• Building a very large instruction window is not an easy task.

Page 4: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

4

Small Windows: Full-window Stalls

• Instructions are retired in-order from the instruction window to support precise exceptions.

• When a very long-latency instruction is not complete, it blocks retirement and incoming instructions fill the instruction window if the window is not large enough.

• Processor cannot place new instructions into the window if the window is already full. This is called a full-window stall.

• L2 misses are responsible for most full-window stalls.

Page 5: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

5

Impact of Full-window Stalls

IPC: 0.77

IPC: 1.69 IPC: 1.15

0%

10%

20%

30%

40%

50%

60%

70%

80%

Machine Model (L2 size, instruction window size)

% c

ycle

s w

ith

fu

ll w

ind

ow s

tall

s 512KB L2, 128-entry window

perfect L2, 128-entry window

512KB L2, 2048-entry window

Page 6: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

6

Overview of Runahead Execution

• During a significant percentage of full-window stall cycles, no work is performed in the processor.

• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction.

• Enter runahead mode when the oldest instruction is an L2-miss load and remove that load from the processor.

• While in runahead mode, keep processing instructions without updating architectural state and without blocking the instruction window due to L2 misses.

• When the original load miss returns back, resume normal-mode execution starting with the runahead-causing load.

Page 7: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

7

Benefits of Runahead Execution

• Loads and stores independent of L2-miss instructions generate useful prefetch requests:– From main memory to L2

– From L2 to L1

• Instructions on the predicted program path are prefetchedinto the trace cache and L2.

• Hardware prefetcher tables are trained using future memory access information. The prefetcher also runs ahead along with the processor.

Page 8: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

8

Mechanism

Page 9: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

9

Entry into Runahead Mode

When an L2-miss load instruction reaches the head of the instruction window:

• Processor checkpoints architectural register state, branch history register, return address stack.

• Processor records the address of the L2-miss load.

• Processor enters runahead mode.

• L2-miss load marks its destination register as invalid and is removed from the instruction window.

Page 10: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

10

Processing in Runahead Mode

• Two types of results are produced: INV (invalid), VALID

• First INV result is produced by the L2-miss load that caused entry into runahead mode.

• An instruction produces an INV result– If it sources an INV result– If it misses in the L2 cache (A prefetch request is generated)

• INV results are marked using INV bits in the register file, store buffer, and runahead cache.– INV bits prevent introduction of bogus data into the pipeline.– Bogus values are not used for prefetching/branch resolution.

Page 11: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

11

Pseudo-retirement in Runahead Mode

• An instruction is examined for pseudo-retirement when it reaches the head of the instruction window.

• An INV instruction is removed from window immediately.

• A VALID instruction is removed when it completes execution and updates only the microarchitectural state.

• Pseudo-retired instructions free their allocated resources.

• Pseudo-retired runahead stores communicate their data and INV status to dependent runahead loads.

Page 12: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

12

Runahead Cache

• An auxiliary structure that holds the data values and INV bits for memory locations modified by pseudo-retired runaheadstores.

• Its purpose is memory communication during runahead mode.

• Runahead loads access store buffer, runahead cache, and L1 data cache in parallel.

• Size of runahead cache is very small (512 bytes).

Page 13: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

13

Runahead Branches

• Runahead branches use the same predictor as normal branches.

• VALID branches are resolved and trigger recovery if mispredicted.

• INV branches cannot be resolved.

Page 14: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

14

Exit from Runahead Mode

When the data for the instruction that caused entry into runahead returns from main memory:

• All instructions in the machine are flushed.

• INV bits are reset. Runahead cache is flushed.

• Processor restores the state as it was before the runahead-inducing instruction was fetched.

• Processor starts fetch beginning with the runahead-inducing L2-miss instruction.

Page 15: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

15

Experimental Evaluation

Page 16: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

16

Baseline Processor

• 3-wide fetch, 29-stage pipeline

• 128-entry instruction window

• 32 KB, 8-way, 3-cycle L1 data cache, write-back

• 512 KB, 8-way, 16-cycle L2 unified cache, write-back

• Approximately 500-cycle penalty for L2 misses

• Streaming prefetcher (16 streams)

Page 17: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

17

Benchmarks

• Selected traces out of a pool of 280 traces

• Evaluated performance on those that gain at least 10% IPC improvement with perfect L2 cache

• 147 traces simulated for 30 million x86 instructions

• Trace Suites– SPEC CPU95 (S95): 10 benchmarks, mostly FP

– SPEC FP2k (FP00): 11 benchmarks

– SPECint2k (INT00): 6 benchmarks

– Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001

– Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake

– Productivity (PROD): 17 benchmarks: Sysmark2k, winstone

– Server (SERV): 2 benchmarks: tpcc, timesten

– Workstation (WS): 7 benchmarks: CAD, nastran, verilog

Page 18: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

18

Performance of Runahead Execution

22%

52%16%

12%22%

15%

13%

35%

12%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ion

s P

er C

ycle

No prefetcher, no runaheadPrefetcher, no runaheadRunahead, no prefetcherRunahead and prefetcher

Page 19: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

19

Effect of Frontend on Runahead

• Average number of instructions during runahead: 711– Before mispredicted INV branch: 431

• Average number of L2 misses during runahead: 2.6– Before mispredicted INV branch: 2.38

• Runahead becomes more effective with a better frontend:– Real trace cache: 22% IPC improvement– Perfect trace cache: 27% IPC improvement– Perfect branch predictor and trace cache: 31% IPC improvement

Page 20: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

20

Runahead vs. Large Windows

-6%

4%

6%

0%

3%-1%

2%12%

3%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ions

Per

Cyc

le

128-entry window with Runahead256-entry window384-entry window512-entry window

Page 21: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

21

Importance of Store-Load Data Communication

10%

23%12%

7%13%

9%

5%

3%

8%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ions

Per

Cyc

le

Baseline

Runahead, no runahead cache

Runahead with runahead cache

Page 22: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

22

Conclusions

• Runahead execution results in 22% IPC increase over the baseline processor with a 128-entry window and a streaming prefetcher.

• This is within 1% of the IPC of a 384-entry window machine.

• Runahead and the streaming prefetcher interact positively.

• Store-load data communication through memory in runaheadmode is vital for high performance.

Page 23: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

23

Backup Slides

Page 24: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

24

Added Bits & Structures

Page 25: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

25

Runahead-Prefetcher Interaction

55%48%

36%24%

3%

82%

53%

17%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

tpcc verilog nastran gcc vortex mgrid apsi ammp

Inst

ruct

ions

Per

Cyc

le

No prefetcher, no runaheadPrefetcher, no runaheadRunahead, no prefetcherRunahead and prefetcher

Page 26: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

26Effect of a Better Frontend

27%

75%

27%

15%

26%21%

13%

35%

13%

31%

89%

37%

20%

34%29%

8%

36%

13%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ions

Per

Cyc

le

Baseline w/ perf TC

Runahead w/ perf TC

Baseline w/ perf TC and BP

Runahead w/ perf TC and BP

Page 27: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

27

When do we see the L2 misses in Runahead?L2 Data Miss Distances

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

miss 1 miss 2 miss 3 miss 4 miss 5 miss 6 miss 7 miss 8 miss 9 miss 10 miss 11 miss 12 miss 13 miss 14 miss 15 miss 16

nth out-of-window miss

Cyc

les

or in

stru

ctio

ns a

fter t

he ru

nahe

ad-c

ausi

ng L

2 m

iss

cycles

instrs

Page 28: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

28

Distance of the first L2 miss in runahead modeWhere does the first L2 miss occur?

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

128-256 256-384 384-512 512-640 640-768 768-896 896-1024 1024-1536 1536+

Distance in instructions from the runahead-causing L2 miss

Per

cent

age

of fi

rst o

ut-o

f-w

indo

w L

2 m

isse

s

Page 29: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

29

Instruction vs. Data Prefetching Benefit

88%

96%55%

87%91%

74%

94%

97%

87%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ions

Per

Cyc

le

Baseline

Runahead with no inst benefit

Runahead with all benefits

Page 30: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

30

Runahead with a Larger L2 Cache

22%

52%16%

12%22%

15%

13%

35%

12%

17%

40%13%

10%19%

12%

14%

30%

7%

16%

32%

8%

8%13%

11%

30%

27%

6%

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ion

s P

er C

ycle

No Runahead - 0.5 MB L2Runahead - 0.5 MB L2No Runahead - 1 MB L2Runahead - 1 MB L2No Runahead - 4 MB L2Runahead - 4 MB L2

Page 31: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

31

Runahead on in-order?

40%

55%

28%

14%

21%17%

74%

73%

15%

22%

52%16%

12%22%

15%

13%

35%

12%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ion

s P

er C

ycle

in-order baselinein-order baseline with runaheadout-of-order baselineout-of-order baseline with runahead

Page 32: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

32

Future Model Results

Page 33: Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction

33

Future Model with Better Frontend