continual flow pipelines - computer action teamweb.cecs.pdx.edu/~zeshan/cfp_asplos2004-6up.pdf ·...
TRANSCRIPT
1
Tuesday, October 12, 2004Tuesday, October 12, 2004Tuesday, October 12, 2004
Continual Flow Pipelines
Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton
Intel Corporation
ASPLOS-XI October 2004
2
Problem: Memory latency
• Long latency memory operations
– disrupt pipeline and stall back-end
• Very large caches
– inefficient performance
– negatively impact die size
• Very large instruction window (>4K)
– memory parallelism and latency tolerance
– conventional methods intractable
Need high performance using small caches and buffers
3
TPC-C profile
1400
1500
1600
1700
1800
1900
2000
2100
2200
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183
Retired Uop ID
Cycle
Cycle Allocated
Cycle Executed
UL2 Miss UOP
and dependents
Alloc stalls
Miss LatencyAlloc resumes
ROB size
Slope = f (core pipeline)
4
What happened?
• L2 miss and dependents (blocked)
– comprise small fraction of window
– block ROB and other resources
• Miss-independents
– comprise large fraction of window
– can proceed but don’t have resources
• ROB cannot overlap L2 miss
– memory latency exposed
– pipeline disrupted—backend stalls
5
Solution: Continual Flow Pipelines
Treat Blocked & Independents differently
• Blocked (Miss-dependents, Slice)
– create self-contained slice and drain out
– free critical resources
– execute when miss returns
• Independents
– execute and “pseudo-retire”
– retain ability to undo using checkpoint
• Automatic result integration
– no re-execution!6
Outline
• Introduction
• Motivation
• Continual Flow Pipelines
– Concepts
– Performance
– Analysis
• Summary
2
7
Motivation (1/3)
• No misses � no problem
– instructions execute quickly, free resources
– critical resources sized for L2 hit
• Long latency miss � stalls backend
– blocked instructions cannot execute
• occupy and block resources
– large window needed (>4K instructions)
8
Motivation (2/3)
• Blocked instructions cannot execute
– may continue to occupy scheduler entries
– put pressure on register file
I2
I1PR1 PR4 LD
PR6 PR1+PR4
I3
PR2
Completed Sources
Not read by slice yet
Cannot be released
Blocked dests.
Name allocated
Cannot be released
9
Motivation (3/3)
• Scheduler
– non-blocking proposals exist
• Pentium 4-style replay
• WIB (as large as instruction window)
• Registers
– no solution for completed source registers
– late allocation of blocked destinations requires significant pipeline changes
No unified non-blocking solution exists
10
Outline
• Introduction
• Motivation
• Continual Flow Pipelines
– Concepts
– Performance
– Analysis
• Summary
11
Continual Flow Pipelines
Treat Blocked & Independents differently
1. Blocked
– create self-contained slice and drain out
– free critical resources
– execute when miss returns
2. Independents
– execute and “pseudo-retire”
– retain ability to undo using checkpoint
3. Automatic result integration
– no re-execution12
CFP key actions
Detect L2 miss and save state
1. Drain slice (incl. completed sources)
– identify slice using poisoning
– mark completed sources as read
– release resources
2. Process, execute, “retire” independents
3. When L2 miss serviced
– re-allocate resources, execute L2 miss slice
4. Merge outputs of slice and independents
3
13
1. Identifying and draining slice
• Propagate poison (NAV) bits to consumers
– registers
– store queue (for memory poisoning)
• Blocked instructions (Slice)
– treat NAV source reg. as “ready” and “read”
– read completed source reg. and mark “read”
– flow and leave pipeline normally
• allow poisoned destination reg. to be released
– enter a buffer with completed source value
• record renamed names for registers
14
2. Processing slice & independents
• Independents execute normally
– perform memory operations
– “pseudo retire”
– release critical resources
• Rename map filter
– tracks live outs
• the last writer
– “all” allocated instructions update filter
15
3. Executing slice
• When L2 miss data returns
• Blocked instructions in SDB
– have completed source “value”
• don’t re-read register file for completed sources
• treated as immediates
– re-acquire register names
• physical-to-physical register renaming
• filter reads ensures appropriate live out linking
– re-acquire scheduler entry
– execute normally through pipeline
16
4. Merging outputs
• Automatic integration
– no explicit operations required
– rename map filter tracks live outs continuously
17
Block diagram for CFP
Dec
od
e
Allo
cate &
Ren
am
e
Qu
eue u
OP
Sch
edu
ler
Reg
ister File/B
yp
ass
Data
Cach
e/FU
L2
Cach
e
Mem
ory
i/f
LD
re-renamed
18
Key components
• Slice Data Buffer
– dense FIFO SRAM structure
– significantly smaller than target window
• Slice filter and remapper
– fixed size structure
• Checkpoints
– for recovering in the event of recovery
• L2 Load buffer and L2 Store queue
Simple dense structures off the critical path
4
19
Outline
• Introduction
• Motivation
• Continual Flow Pipelines
– Concepts
– Performance
– Analysis
• Summary
20
CFP performance
0%
10%
20%
30%
40%
50%
60%
SFP2K SINT2K WEB MM PROD SERVER WS
% S
pe
ed
up
ove
r b
as
e
Base+CFP
Base (ideal)
21
CFP performance intuition
• Memory latency tolerance
– significant useful independent work
– no re-execution, automatic result integration
– can tolerate first and isolated misses
• Memory level parallelism
– when clustered misses occur
– overlap multiple misses
CFP achieves BOTH22
Analysis: Base description
1400
1500
1600
1700
1800
1900
2000
2100
2200
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183
Retired Uop ID
Cycle
Cycle Allocated
Cycle Executed
UL2 Miss UOP
and dependents
Alloc stalls
Miss LatencyAlloc resumes
ROB size
Slope = f (core pipeline)
23
Analysis: SFP2K overlay
31800
32300
32800
33300
33800
34300
1 212 423 634 845 1056 1267 1478 1689 1900 2111 2322 2533 2744 2955 3166 3377 3588 3799 4010 4221 4432 4643 4854
Retired Uops ID
Cycle
Slice Reinsertion
1st UL2 Miss 2nd UL2 MissBASE
CFP
CFP: No stalls, all misses hidden
24
Analysis: TPC-C overlay (events)
700
1200
1700
2200
2700
3200
3700
1 170 339 508 677 846 1015 1184 1353 1522 1691 1860 2029 2198 2367 2536 2705 2874 3043 3212 3381 3550 3719 3888
Retired Uops ID
Cycle
1st UL2 Miss
2nd UL2 Miss
3rd UL2 Miss
Slope Identical
BASE
CFP
Slice Branch
Mispredicts
Independent Branch
Mispredicts
5
25
Analysis: CFP vs. Runahead
0
500
1000
1500
2000
2500
3000
3500
4000
1 299 597 895 1193 1491 1789 2087 2385 2683 2981 3279 3577 3875 4173 4471 4769 5067 5365 5663 5961 6259
Retired Uop ID
Cycle
1st UL2 Miss
2nd UL2 Miss
3rd UL2 Miss
Slice Reinsertion
Slice Reinsertion
RA: 2 “stalls”, hides 2nd miss
BASE: 3 stalls
CFP: 0 stalls, hides ALL misses
BASE
CFP
RA
26
1500
2000
2500
3000
3500
4000
4500
1 274 547 820 1093 1366 1639 1912 2185 2458 2731 3004 3277 3550 3823 4096 4369 4642 4915 5188 5461 5734
Retired Uops ID
Cycle
CFP+CPR vs. CFP+ROB
CFP+CPR slope < than CFP+ROB
lower core CPI
27
Summary
• Treat miss-dependents differently
– Miss-dependent blocked instructions
• release register, scheduler; buffer outside pipeline
– Independent instructions
• execute, pseudo-retire, don’t re-execute
• Unified non-blocking proposal
– High memory-latency tolerance
– Enables high cache efficiency
– High single-thread performance
– Enables more cores instead of more cache
28
CFP cache efficiency
-40%
-30%
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
SFP2K SINT2K WEB MM PROD SERVER WS
% S
pee
du
p o
ve
r ba
se
line R
OB
(2
MB
L2) ROB+L2_512KB ROB+L2_1MB
ROB+L2_4MB ROB+L2_8MB
CFP+L2_512KB CFP+L2_1MB
CFP+L2_2MB
29
Analysis: TPC-C and CFP
1500
1600
1700
1800
1900
2000
2100
1 88 175 262 349 436 523 610 697 784 871 958 1045 1132 1219 1306 1393 1480 1567 1654 1741 1828
Retired Uop ID
Cycle
Cycle Allocated
Cycle Executed
UL2 Miss UOP
and dependentsBranch Mispredicted
in Independents
Alloc/Exec continues