continual flow pipelines - computer action teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf ·...

5
1 Tuesday, October 12, 2004 Tuesday, October 12, 2004 Tuesday, October 12, 2004 Continual Flow Pipelines Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton Intel Corporation ASPLOS-XI October 2004 2 Problem: Memory latency Long latency memory operations – disrupt pipeline and stall back-end Very large caches – inefficient performance – negatively impact die size Very large instruction window (>4K) – memory parallelism and latency tolerance – conventional methods intractable Need high performance using small caches and buffers 3 TPC-C profile 1400 1500 1600 1700 1800 1900 2000 2100 2200 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 Retired Uop ID Cycle Cycle Allocated Cycle Executed UL2 Miss UOP and dependents Alloc stalls Miss Latency Alloc resumes ROB size Slope = f (core pipeline) 4 What happened? L2 miss and dependents (blocked) – comprise small fraction of window – block ROB and other resources • Miss-independents – comprise large fraction of window – can proceed but don’t have resources ROB cannot overlap L2 miss – memory latency exposed – pipeline disrupted—backend stalls 5 Solution: Continual Flow Pipelines Treat Blocked & Independents differently Blocked (Miss-dependents, Slice) – create self-contained slice and drain out – free critical resources – execute when miss returns • Independents – execute and “pseudo-retire” – retain ability to undo using checkpoint Automatic result integration – no re-execution! 6 Outline • Introduction • Motivation Continual Flow Pipelines – Concepts – Performance – Analysis • Summary

Upload: others

Post on 10-Jul-2020

13 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Continual Flow Pipelines - Computer Action Teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf · • Continual Flow Pipelines – Concepts – Performance – Analysis • Summary

1

Tuesday, October 12, 2004Tuesday, October 12, 2004Tuesday, October 12, 2004

Continual Flow Pipelines

Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton

Intel Corporation

ASPLOS-XI October 2004

2

Problem: Memory latency

• Long latency memory operations

– disrupt pipeline and stall back-end

• Very large caches

– inefficient performance

– negatively impact die size

• Very large instruction window (>4K)

– memory parallelism and latency tolerance

– conventional methods intractable

Need high performance using small caches and buffers

3

TPC-C profile

1400

1500

1600

1700

1800

1900

2000

2100

2200

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183

Retired Uop ID

Cycle

Cycle Allocated

Cycle Executed

UL2 Miss UOP

and dependents

Alloc stalls

Miss LatencyAlloc resumes

ROB size

Slope = f (core pipeline)

4

What happened?

• L2 miss and dependents (blocked)

– comprise small fraction of window

– block ROB and other resources

• Miss-independents

– comprise large fraction of window

– can proceed but don’t have resources

• ROB cannot overlap L2 miss

– memory latency exposed

– pipeline disrupted—backend stalls

5

Solution: Continual Flow Pipelines

Treat Blocked & Independents differently

• Blocked (Miss-dependents, Slice)

– create self-contained slice and drain out

– free critical resources

– execute when miss returns

• Independents

– execute and “pseudo-retire”

– retain ability to undo using checkpoint

• Automatic result integration

– no re-execution!6

Outline

• Introduction

• Motivation

• Continual Flow Pipelines

– Concepts

– Performance

– Analysis

• Summary

Page 2: Continual Flow Pipelines - Computer Action Teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf · • Continual Flow Pipelines – Concepts – Performance – Analysis • Summary

2

7

Motivation (1/3)

• No misses � no problem

– instructions execute quickly, free resources

– critical resources sized for L2 hit

• Long latency miss � stalls backend

– blocked instructions cannot execute

• occupy and block resources

– large window needed (>4K instructions)

8

Motivation (2/3)

• Blocked instructions cannot execute

– may continue to occupy scheduler entries

– put pressure on register file

I2

I1PR1 PR4 LD

PR6 PR1+PR4

I3

PR2

Completed Sources

Not read by slice yet

Cannot be released

Blocked dests.

Name allocated

Cannot be released

9

Motivation (3/3)

• Scheduler

– non-blocking proposals exist

• Pentium 4-style replay

• WIB (as large as instruction window)

• Registers

– no solution for completed source registers

– late allocation of blocked destinations requires significant pipeline changes

No unified non-blocking solution exists

10

Outline

• Introduction

• Motivation

• Continual Flow Pipelines

– Concepts

– Performance

– Analysis

• Summary

11

Continual Flow Pipelines

Treat Blocked & Independents differently

1. Blocked

– create self-contained slice and drain out

– free critical resources

– execute when miss returns

2. Independents

– execute and “pseudo-retire”

– retain ability to undo using checkpoint

3. Automatic result integration

– no re-execution12

CFP key actions

Detect L2 miss and save state

1. Drain slice (incl. completed sources)

– identify slice using poisoning

– mark completed sources as read

– release resources

2. Process, execute, “retire” independents

3. When L2 miss serviced

– re-allocate resources, execute L2 miss slice

4. Merge outputs of slice and independents

Page 3: Continual Flow Pipelines - Computer Action Teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf · • Continual Flow Pipelines – Concepts – Performance – Analysis • Summary

3

13

1. Identifying and draining slice

• Propagate poison (NAV) bits to consumers

– registers

– store queue (for memory poisoning)

• Blocked instructions (Slice)

– treat NAV source reg. as “ready” and “read”

– read completed source reg. and mark “read”

– flow and leave pipeline normally

• allow poisoned destination reg. to be released

– enter a buffer with completed source value

• record renamed names for registers

14

2. Processing slice & independents

• Independents execute normally

– perform memory operations

– “pseudo retire”

– release critical resources

• Rename map filter

– tracks live outs

• the last writer

– “all” allocated instructions update filter

15

3. Executing slice

• When L2 miss data returns

• Blocked instructions in SDB

– have completed source “value”

• don’t re-read register file for completed sources

• treated as immediates

– re-acquire register names

• physical-to-physical register renaming

• filter reads ensures appropriate live out linking

– re-acquire scheduler entry

– execute normally through pipeline

16

4. Merging outputs

• Automatic integration

– no explicit operations required

– rename map filter tracks live outs continuously

17

Block diagram for CFP

Dec

od

e

Allo

cate &

Ren

am

e

Qu

eue u

OP

Sch

edu

ler

Reg

ister File/B

yp

ass

Data

Cach

e/FU

L2

Cach

e

Mem

ory

i/f

LD

re-renamed

18

Key components

• Slice Data Buffer

– dense FIFO SRAM structure

– significantly smaller than target window

• Slice filter and remapper

– fixed size structure

• Checkpoints

– for recovering in the event of recovery

• L2 Load buffer and L2 Store queue

Simple dense structures off the critical path

Page 4: Continual Flow Pipelines - Computer Action Teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf · • Continual Flow Pipelines – Concepts – Performance – Analysis • Summary

4

19

Outline

• Introduction

• Motivation

• Continual Flow Pipelines

– Concepts

– Performance

– Analysis

• Summary

20

CFP performance

0%

10%

20%

30%

40%

50%

60%

SFP2K SINT2K WEB MM PROD SERVER WS

% S

pe

ed

up

ove

r b

as

e

Base+CFP

Base (ideal)

21

CFP performance intuition

• Memory latency tolerance

– significant useful independent work

– no re-execution, automatic result integration

– can tolerate first and isolated misses

• Memory level parallelism

– when clustered misses occur

– overlap multiple misses

CFP achieves BOTH22

Analysis: Base description

1400

1500

1600

1700

1800

1900

2000

2100

2200

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183

Retired Uop ID

Cycle

Cycle Allocated

Cycle Executed

UL2 Miss UOP

and dependents

Alloc stalls

Miss LatencyAlloc resumes

ROB size

Slope = f (core pipeline)

23

Analysis: SFP2K overlay

31800

32300

32800

33300

33800

34300

1 212 423 634 845 1056 1267 1478 1689 1900 2111 2322 2533 2744 2955 3166 3377 3588 3799 4010 4221 4432 4643 4854

Retired Uops ID

Cycle

Slice Reinsertion

1st UL2 Miss 2nd UL2 MissBASE

CFP

CFP: No stalls, all misses hidden

24

Analysis: TPC-C overlay (events)

700

1200

1700

2200

2700

3200

3700

1 170 339 508 677 846 1015 1184 1353 1522 1691 1860 2029 2198 2367 2536 2705 2874 3043 3212 3381 3550 3719 3888

Retired Uops ID

Cycle

1st UL2 Miss

2nd UL2 Miss

3rd UL2 Miss

Slope Identical

BASE

CFP

Slice Branch

Mispredicts

Independent Branch

Mispredicts

Page 5: Continual Flow Pipelines - Computer Action Teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf · • Continual Flow Pipelines – Concepts – Performance – Analysis • Summary

5

25

Analysis: CFP vs. Runahead

0

500

1000

1500

2000

2500

3000

3500

4000

1 299 597 895 1193 1491 1789 2087 2385 2683 2981 3279 3577 3875 4173 4471 4769 5067 5365 5663 5961 6259

Retired Uop ID

Cycle

1st UL2 Miss

2nd UL2 Miss

3rd UL2 Miss

Slice Reinsertion

Slice Reinsertion

RA: 2 “stalls”, hides 2nd miss

BASE: 3 stalls

CFP: 0 stalls, hides ALL misses

BASE

CFP

RA

26

1500

2000

2500

3000

3500

4000

4500

1 274 547 820 1093 1366 1639 1912 2185 2458 2731 3004 3277 3550 3823 4096 4369 4642 4915 5188 5461 5734

Retired Uops ID

Cycle

CFP+CPR vs. CFP+ROB

CFP+CPR slope < than CFP+ROB

lower core CPI

27

Summary

• Treat miss-dependents differently

– Miss-dependent blocked instructions

• release register, scheduler; buffer outside pipeline

– Independent instructions

• execute, pseudo-retire, don’t re-execute

• Unified non-blocking proposal

– High memory-latency tolerance

– Enables high cache efficiency

– High single-thread performance

– Enables more cores instead of more cache

28

CFP cache efficiency

-40%

-30%

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

SFP2K SINT2K WEB MM PROD SERVER WS

% S

pee

du

p o

ve

r ba

se

line R

OB

(2

MB

L2) ROB+L2_512KB ROB+L2_1MB

ROB+L2_4MB ROB+L2_8MB

CFP+L2_512KB CFP+L2_1MB

CFP+L2_2MB

29

Analysis: TPC-C and CFP

1500

1600

1700

1800

1900

2000

2100

1 88 175 262 349 436 523 610 697 784 871 958 1045 1132 1219 1306 1393 1480 1567 1654 1741 1828

Retired Uop ID

Cycle

Cycle Allocated

Cycle Executed

UL2 Miss UOP

and dependentsBranch Mispredicted

in Independents

Alloc/Exec continues