runahead execution - computer action teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• runahead...

Runahead Execution

An Alternative to Very Large Instruction Windows for Out-of-order Processors

Onur MutluYale N. Patt

Jared StarkChris Wilkerson

2

Outline

• Motivation

• Overview

• Mechanism

• Experimental Evaluation

• Conclusions

3

Motivation

• Out-of-order processors require very large instruction windows to tolerate today’s main memory latencies.– Even in the presence of caches and prefetchers

• As main memory latency (in terms of processor cycles) increases, instruction window size should also increase to fully tolerate the memory latency.

• Building a very large instruction window is not an easy task.

4

Small Windows: Full-window Stalls

• Instructions are retired in-order from the instruction window to support precise exceptions.

• When a very long-latency instruction is not complete, it blocks retirement and incoming instructions fill the instruction window if the window is not large enough.

• Processor cannot place new instructions into the window if the window is already full. This is called a full-window stall.

• L2 misses are responsible for most full-window stalls.

5

Impact of Full-window Stalls

IPC: 0.77

IPC: 1.69 IPC: 1.15

0%

10%

20%

30%

40%

50%

60%

70%

80%

Machine Model (L2 size, instruction window size)

% c

ycle

s w

ith

fu

ll w

ind

ow s

tall

s 512KB L2, 128-entry window

perfect L2, 128-entry window

512KB L2, 2048-entry window

6

Overview of Runahead Execution

• During a significant percentage of full-window stall cycles, no work is performed in the processor.

• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction.

• Enter runahead mode when the oldest instruction is an L2-miss load and remove that load from the processor.

• While in runahead mode, keep processing instructions without updating architectural state and without blocking the instruction window due to L2 misses.

• When the original load miss returns back, resume normal-mode execution starting with the runahead-causing load.

7

Benefits of Runahead Execution

• Loads and stores independent of L2-miss instructions generate useful prefetch requests:– From main memory to L2

– From L2 to L1

• Instructions on the predicted program path are prefetchedinto the trace cache and L2.

• Hardware prefetcher tables are trained using future memory access information. The prefetcher also runs ahead along with the processor.

8

Mechanism

9

Entry into Runahead Mode

When an L2-miss load instruction reaches the head of the instruction window:

• Processor checkpoints architectural register state, branch history register, return address stack.

• Processor records the address of the L2-miss load.

• Processor enters runahead mode.

• L2-miss load marks its destination register as invalid and is removed from the instruction window.

10

Processing in Runahead Mode

• Two types of results are produced: INV (invalid), VALID

• First INV result is produced by the L2-miss load that caused entry into runahead mode.

• An instruction produces an INV result– If it sources an INV result– If it misses in the L2 cache (A prefetch request is generated)

• INV results are marked using INV bits in the register file, store buffer, and runahead cache.– INV bits prevent introduction of bogus data into the pipeline.– Bogus values are not used for prefetching/branch resolution.

11

Pseudo-retirement in Runahead Mode

• An instruction is examined for pseudo-retirement when it reaches the head of the instruction window.

• An INV instruction is removed from window immediately.

• A VALID instruction is removed when it completes execution and updates only the microarchitectural state.

• Pseudo-retired instructions free their allocated resources.

• Pseudo-retired runahead stores communicate their data and INV status to dependent runahead loads.

12

Runahead Cache

• An auxiliary structure that holds the data values and INV bits for memory locations modified by pseudo-retired runaheadstores.

• Its purpose is memory communication during runahead mode.

• Runahead loads access store buffer, runahead cache, and L1 data cache in parallel.

• Size of runahead cache is very small (512 bytes).

13

Runahead Branches

• Runahead branches use the same predictor as normal branches.

• VALID branches are resolved and trigger recovery if mispredicted.

• INV branches cannot be resolved.

14

Exit from Runahead Mode

When the data for the instruction that caused entry into runahead returns from main memory:

• All instructions in the machine are flushed.

• INV bits are reset. Runahead cache is flushed.

• Processor restores the state as it was before the runahead-inducing instruction was fetched.

• Processor starts fetch beginning with the runahead-inducing L2-miss instruction.

15

Experimental Evaluation

16

Baseline Processor

• 3-wide fetch, 29-stage pipeline

• 128-entry instruction window

• 32 KB, 8-way, 3-cycle L1 data cache, write-back

• 512 KB, 8-way, 16-cycle L2 unified cache, write-back

• Approximately 500-cycle penalty for L2 misses

• Streaming prefetcher (16 streams)

17

Benchmarks

• Selected traces out of a pool of 280 traces

• Evaluated performance on those that gain at least 10% IPC improvement with perfect L2 cache

• 147 traces simulated for 30 million x86 instructions

• Trace Suites– SPEC CPU95 (S95): 10 benchmarks, mostly FP

– SPEC FP2k (FP00): 11 benchmarks

– SPECint2k (INT00): 6 benchmarks

– Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001

– Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake

– Productivity (PROD): 17 benchmarks: Sysmark2k, winstone

– Server (SERV): 2 benchmarks: tpcc, timesten

– Workstation (WS): 7 benchmarks: CAD, nastran, verilog

18

Performance of Runahead Execution

22%

52%16%

12%22%

15%

13%

35%

12%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

S95 FP00 INT00 WEB MM PROD SERV WS AVG

Inst

ruct

ion

s P

er C

ycle

No prefetcher, no runaheadPrefetcher, no runaheadRunahead, no prefetcherRunahead and prefetcher

19

Effect of Frontend on Runahead

• Average number of instructions during runahead: 711– Before mispredicted INV branch: 431

• Average number of L2 misses during runahead: 2.6– Before mispredicted INV branch: 2.38

• Runahead becomes more effective with a better frontend:– Real trace cache: 22% IPC improvement– Perfect trace cache: 27% IPC improvement– Perfect branch predictor and trace cache: 31% IPC improvement

20

Runahead vs. Large Windows

-6%

4%

6%

0%

3%-1%

2%12%

3%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5


Inst

ruct

ions

Per

Cyc

le

128-entry window with Runahead256-entry window384-entry window512-entry window

21

Importance of Store-Load Data Communication

10%

23%12%

7%13%

9%

5%

3%

8%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3


Inst

ruct

ions

Per

Cyc

le

Baseline

Runahead, no runahead cache

Runahead with runahead cache

22

Conclusions

• Runahead execution results in 22% IPC increase over the baseline processor with a 128-entry window and a streaming prefetcher.

• This is within 1% of the IPC of a 384-entry window machine.

• Runahead and the streaming prefetcher interact positively.

• Store-load data communication through memory in runaheadmode is vital for high performance.

23

Backup Slides

24

Added Bits & Structures

25

Runahead-Prefetcher Interaction

55%48%

36%24%

3%

82%

53%

17%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

tpcc verilog nastran gcc vortex mgrid apsi ammp

Inst

ruct

ions

Per

Cyc

le

No prefetcher, no runaheadPrefetcher, no runaheadRunahead, no prefetcherRunahead and prefetcher

26Effect of a Better Frontend

27%

75%

27%

15%

26%21%

13%

35%

13%

31%

89%

37%

20%

34%29%

8%

36%

13%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4


Inst

ruct

ions

Per

Cyc

le

Baseline w/ perf TC

Runahead w/ perf TC

Baseline w/ perf TC and BP

Runahead w/ perf TC and BP

27

When do we see the L2 misses in Runahead?L2 Data Miss Distances

0

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

miss 1 miss 2 miss 3 miss 4 miss 5 miss 6 miss 7 miss 8 miss 9 miss 10 miss 11 miss 12 miss 13 miss 14 miss 15 miss 16

nth out-of-window miss

Cyc

les

or in

stru

ctio

ns a

fter t

he ru

nahe

ad-c

ausi

ng L

2 m

iss

cycles

instrs

28

Distance of the first L2 miss in runahead modeWhere does the first L2 miss occur?

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

128-256 256-384 384-512 512-640 640-768 768-896 896-1024 1024-1536 1536+

Distance in instructions from the runahead-causing L2 miss

Per

cent

age

of fi

rst o

ut-o

f-w

indo

w L

2 m

isse

s

29

Instruction vs. Data Prefetching Benefit

88%

96%55%

87%91%

74%

94%

97%

87%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3


Inst

ruct

ions

Per

Cyc

le

Baseline

Runahead with no inst benefit

Runahead with all benefits

30

Runahead with a Larger L2 Cache

22%

52%16%

12%22%

15%

13%

35%

12%

17%

40%13%

10%19%

12%

14%

30%

7%

16%

32%

8%

8%13%

11%

30%

27%

6%

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4


Inst

ruct

ion

s P

er C

ycle

No Runahead - 0.5 MB L2Runahead - 0.5 MB L2No Runahead - 1 MB L2Runahead - 1 MB L2No Runahead - 4 MB L2Runahead - 4 MB L2

31

Runahead on in-order?

40%

55%

28%

14%

21%17%

74%

73%

15%

22%

52%16%

12%22%

15%

13%

35%

12%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3


Inst

ruct

ion

s P

er C

ycle

in-order baselinein-order baseline with runaheadout-of-order baselineout-of-order baseline with runahead

32

Future Model Results

33

Future Model with Better Frontend

runahead execution - computer action teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• runahead...

Documents