energy efficient latency tolerance: single-thread performance for the multi-core era
DESCRIPTION
Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era. Andrew Hilton University of Pennsylvania [email protected]. Duke :: March 18, 2010. Multi-Core Architecture. Atom. Atom. Atom. Atom. Single-thread performance growth has diminished - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/1.jpg)
Duke:: March 18, 2010
Energy Efficient Latency Tolerance:Single-Thread Performance for the Multi-Core Era
Andrew HiltonUniversity of [email protected]
![Page 2: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/2.jpg)
Multi-Core Architecture
[ 2 ][ 2 ]
Core i7
Single-thread performance growth has diminished• Clock frequency has hit an energy wall• Instruction level parallelism (ILP) has hit energy, memory, idea walls
Future chips will be heterogeneous multi-cores • Few high-performance out-of-order cores (Core i7) for serial code• Many low-power in-order cores (Atom) for parallel code
Atom Atom Atom Atom
Atom Atom Atom Atom
![Page 3: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/3.jpg)
Multi-Core Performance
[ 3 ][ 3 ]
Core i7
Obvious performance key: write more parallel software
Less obvious performance key: speed up existing cores• Core i7? Keep serial portion from becoming a bottleneck (Amdahl)• Atoms? Parallelism is typically not elastic
Key constraint: energy• Thermal limitations of chip, cost of energy, cooling costs,…
Atom Atom Atom Atom
Atom Atom Atom Atom
![Page 4: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/4.jpg)
“TurboBoost”
[ 4 ][ 4 ]
Existing technique: Dynamic Voltage Frequency Scaling?• Increase clock frequency (requires increasing voltage)+Simple+Applicable to both types of cores- Not very energy-efficient (energy ≈ frequency2)- Doesn’t help “memory bound” programs (performance < frequency)
![Page 5: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/5.jpg)
Effectiveness of “TurboBoost”
[ 5 ][ 5 ]
Hig
her
Is b
ett
er
Low
er
is b
ett
er
Example: TurboBoost 3.2 GHz 4.0 GHz (25%)• Ideal conditions: 25% speedup, constant Energy * Delay2
• Memory bound programs: far from ideal
![Page 6: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/6.jpg)
“Memory Bound”
Main memory is slow relative to core (~250 cycles)
Cache hierarchy makes most accesses fast• “Memory bound” = many L3 misses• … or in some cases many L2 misses• … or for in-order cores many L1 misses• Clock frequency (“TurboBoost”) accelerates only core/L1/L2
[ 6 ][ 6 ]
Core i7
Main Memory (250 cycles)
Atom
Atom
L1$
L2$ (10)
L3$ (40 cycles)
L1$
L2$ (10)L1
$L1$
L2$ (10)L1
$L1$
L2$ (10)L1
$Atom
Atom
Atom
Atom
![Page 7: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/7.jpg)
Goal: Help Memory Bound Programs
Wanted: complementary technique to TurboBoost
Successful applicants should• Help “memory bound” programs• Be at least as energy efficient as TurboBoost (at least ED2 constant)• Work well with both out-of-order and in-order cores
Promising previous idea: latency tolerance• Helps “memory bound” programs
My work: energy efficient latency tolerance for all cores• Today: primarily out-of-order (BOLT) [HPCA’10]
[ 7 ][ 7 ]
![Page 8: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/8.jpg)
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects
Other work and future plans
[ 8 ][ 8 ]
![Page 9: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/9.jpg)
[ 9 ][ 9 ]
LLC (Last-Level Cache) Misses
What is this picture? Loads A & H miss caches
This is an in-order processor• Misses serialize latencies add dominate performance
We want Miss Level Parallelism (MLP): overlap A & H
Time
250
250(not to scale)
![Page 10: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/10.jpg)
[ 10 ][ 10 ]
Miss-Level Parallelism (MLP)
One option: prefetching• Requires predicting address of H at A
Another option: out-of-order execution (Core i7)• Requires sufficiently large “window” to do this
Time
250
250 250
![Page 11: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/11.jpg)
[ 11 ][ 11 ]
Out-of-Order Execution & “Window”
Important “window” structures• Register file (number of in-flight instructions): 128 insns on Core i7 • Issue queue (number of un-executed instructions): 36 on Core i7• Sized to “tolerate” (keep core busy for) ~30 cycle latencies• To tolerate ~250 cycles need order of magnitude bigger structures
Latency tolerance big idea: scale window virtually
Rename
IssueQueu
e
Reorder BufferFetch
Register
FileFU
D$
I$
BC A
AD
B C D
D
A
LLC miss
completed
unexecuted
![Page 12: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/12.jpg)
[ 12 ][ 12 ]
Latency Tolerance
Prelude: Add slice buffer• New structure (not in conventional processors)• Can be relatively large: low bandwidth, not in critical execution core
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
A B C DA
RenameFetch
![Page 13: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/13.jpg)
[ 13 ][ 13 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
A B C DA
RenameFetch
![Page 14: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/14.jpg)
[ 14 ][ 14 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
B C D
A
RenameFetch
![Page 15: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/15.jpg)
[ 15 ][ 15 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC A
D
D
B C D
A
miss dependent
RenameFetch
![Page 16: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/16.jpg)
[ 16 ][ 16 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too
Slice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C
AD
RenameFetch
![Page 17: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/17.jpg)
[ 17 ][ 17 ]
Latency Tolerance
Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too• Proceed under miss
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
G
RenameFetch
![Page 18: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/18.jpg)
[ 18 ][ 18 ]
Latency Tolerance
Phase #2: Cache miss return slice in
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
G
RenameFetch
![Page 19: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/19.jpg)
[ 19 ][ 19 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
GA
RenameFetch
![Page 20: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/20.jpg)
[ 20 ][ 20 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
ADEHEF
F
H
I I
G
GA
A
RenameFetch
![Page 21: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/21.jpg)
[ 21 ][ 21 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
DEHEF
F
H
I I
G
GA
RenameFetch
![Page 22: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/22.jpg)
[ 22 ][ 22 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
EHEF
F
H
I I
G
GA
EED
Exception!
RenameFetch
![Page 23: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/23.jpg)
[ 23 ][ 23 ]
Latency Tolerance
Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?• Recover to checkpoint (taken before A)
EHSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AI D
B C
EHEF
F
H
I I
G
GA
EED
Exception!
ChkRenam
eFetch
![Page 24: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/24.jpg)
Slice Self Containment
Important for latency tolerance: self-contained slices• A,D, & E have miss-independent inputs• Capture these values during slice out• This decouples slice from rest of program
[ 24 ][ 24 ]
AB
C
D
E
F
G
H
![Page 25: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/25.jpg)
Latency Tolerance
Latency tolerance example• Slice out miss and dependent instructions “grow” window• Slice in after miss returns
[ 25 ][ 25 ]
Time
Energy:
1.5x
Delay:0.5
x
Energy ≈ #Boxes
Combine into ED2
ED2 < 1.0 = Good
ED2: 0.38x
…
![Page 26: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/26.jpg)
Previous Design: CFP
[ 26 ][ 26 ]
Prior design: Continual Flow Pipelines [Srinivasan’04]
• Obtains speedups, but…
Hig
her
Is b
ett
er
![Page 27: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/27.jpg)
Previous Design: CFP
[ 27 ][ 27 ]
Prior design: Continual Flow Pipelines [Srinivasan’04]
• Obtains speedups, but also slowdowns
Hig
her
Is b
ett
er
![Page 28: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/28.jpg)
Previous Design: CFP
[ 28 ][ 28 ]
Prior design: Continual Flow Pipelines [Srinivasan’04]
• Obtains speedups, but also slowdowns• Typically not energy efficient
Hig
her
Is b
ett
er
Low
er
is b
ett
er
![Page 29: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/29.jpg)
Energy-Efficient Latency Tolerance?
Efficient Implementation• Re-use existing structures when possible• New structures must be simple, low-overhead
Runtime efficiency• Minimize superfluous re-executions
Previous designs have not achieved (or considered) these• Waiting Instruction Buffer [Lebeck’02]
• Continual Flow Pipeline [Srinivasan’04]
• Decoupled Kilo Instruction Processor [Pericas ’06,’07]
[ 29 ][ 29 ]
![Page 30: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/30.jpg)
Sneak Preview: Final Results
[ 30 ][ 30 ]
This talk: my work on efficient latency tolerance+Improved performance+Performance robustness (do no harm)+Performance is energy efficient
Hig
her
Is b
ett
er
Low
er
is b
ett
er
![Page 31: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/31.jpg)
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects
Other work and future plans
[ 31 ][ 31 ]
![Page 32: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/32.jpg)
[ 32 ][ 32 ]
Examination of the Problem
Problem with existing design: register management• Miss-dependent instructions free registers when they execute
Slice BufferADHK L E
Chk
IssueQueu
e
Register
FileFU
D$
I$
BC AI D
B C
EF
F
H
I I
G
G
RenameFetch
Reorder Buffer
![Page 33: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/33.jpg)
[ 33 ][ 33 ]
Examination of the Problem
Problem with existing design: register management• Miss-dependent instructions free registers when they execute• Actually, all instructions free registers when they execute
What’s wrong with this?• No instruction level precise state hurts on branch mispredictions• Execution order slice buffer
Slice BufferADHK L E
IssueQueu
e
Register
FileFU
D$
I$
I
ChkChkRenam
eFetch
hard to re-rename & re-acquire registers
![Page 34: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/34.jpg)
[ 34 ][ 34 ]
BOLT Register Management
Youngest instructions: keep in re-order buffer• Conventional, in-order register freeing
Miss-dependent instructions: in slice buffer• Execution based register freeing
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C DA
RenameFetch
![Page 35: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/35.jpg)
[ 35 ][ 35 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C DA
RenameFetch
![Page 36: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/36.jpg)
[ 36 ][ 36 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C DA
RenameFetch
![Page 37: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/37.jpg)
[ 37 ][ 37 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BC AD
B C D
RenameFetch
![Page 38: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/38.jpg)
[ 38 ][ 38 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BCD
B C D
RenameFetch
A
![Page 39: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/39.jpg)
[ 39 ][ 39 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
BCD
B C D
RenameFetch
A
![Page 40: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/40.jpg)
[ 40 ][ 40 ]
BOLT Register Management
In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed
ChkSlice Buffer
IssueQueu
e
Reorder Buffer
Register
FileFU
D$
I$
CD
C D
RenameFetch
A
![Page 41: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/41.jpg)
[ 41 ][ 41 ]
BOLT Register Management
Benefits of BOLT’s management• Youngest instructions (ROB) get conventional recovery (do no harm)
Slice Buffer Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT ADHKL E
![Page 42: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/42.jpg)
[ 42 ][ 42 ]
BOLT Register Management
Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)
Slice BufferADHKL E
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT
![Page 43: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/43.jpg)
[ 43 ][ 43 ]
BOLT Register Management
Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Register File Contribution #1:Hybrid register management—best of both
worlds
![Page 44: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/44.jpg)
[ 44 ][ 44 ]
BOLT Register Management
Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file
Challenging part: two algorithms, one register file• Note: two register files not a good solution
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
![Page 45: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/45.jpg)
[ 45 ][ 45 ]
Two Algorithms, One Register File
Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
![Page 46: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/46.jpg)
[ 46 ][ 46 ]
Two Algorithms, One Register File
Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer
Aggressive algorithm (slice instructions)• Execution driven reference counting scheme
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
![Page 47: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/47.jpg)
[ 47 ][ 47 ]
Two Algorithms, One Register File
How to combine these two algorithms?• Execution based algorithm uses reference counting• Efficiently encode conventional algorithm as reference counting• Combine both into one reference count matrix
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Register File Contribution #2:Efficient implementation of new hybrid
algorithm
![Page 48: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/48.jpg)
[ 48 ][ 48 ]
Management of Loads and Stores
Large window requires support for many loads and stores • Window effectively A-V now, what about the loads & stores ?• This could be an hour+ talk by itself… so just a small piece
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
![Page 49: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/49.jpg)
Store to Load Dependences
Different from register state: cannot capture inputs• Store -> load dependences determined by addresses
• Cannot “capture” like registers• Must be able to find proper (older, matching) stores
[ 49 ][ 49 ]
B
C
D
E
A
F
?
![Page 50: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/50.jpg)
Store to Load Dependences
Different from register state: cannot capture inputs• Store -> load dependences determined by addresses
• Cannot “capture” like registers• Must be able to find proper (older, matching) stores• Must avoid younger matching stores (“write-after-read” hazards)
[ 50 ][ 50 ]
B
C
D
E
A
F
?
X
![Page 51: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/51.jpg)
[ 51 ][ 51 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 12
addressvalue
0 0 1 0 1 0 0poison
Tail (younger) Head (older)Store Buffers
86 85 84 83 82 81 80
Conventional store queue/store buffer• Holds stores in program order• Loads search “associatively” (all entries in parallel)• Doesn’t scale to sizes we need
For latency tolerance, we need….• Poison (easy)• Scalable way to search, accounting for age (hard)
![Page 52: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/52.jpg)
[ 52 ][ 52 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80
Replace associative search with iterative indexed search• Overlay store buffer with address-based hash table• Exploits in-order nature of speculative retirement to build
44 81 0 15 0 77 0link
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
address
![Page 53: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/53.jpg)
[ 53 ][ 53 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80
44 81 0 15 0 77 0link
Loads follow chain starting at appropriate root table entry• For example, load to address 1AC
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
85AC 2AC85
81
1AC81
Match, forward
address
![Page 54: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/54.jpg)
[ 54 ][ 54 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80
44 81 0 15 0 77 0link
Loads follow chain starting at appropriate root table entry• For example, load to address 1AC
Deferred loads ignore younger stores, avoid WAR hazards• For example, deferred load to address 1B4 …• … whose immediately older store 81 (note when entering pipeline)
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
83B4
1B483
Younger store, ignore
15
Go to D$
address
![Page 55: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/55.jpg)
[ 55 ][ 55 ]
Chained Store Buffer
+ Non-speculative search, scalable …+ Fast• Most non-forwarding loads access root table only• Most forwarding loads find store on first shot• Average number of excess hops < 0.05 with 64-entry root table
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
valuepoison
Tail (younger) Head (older)
86 85 84 83 82 81 80
44 81 0 15 0 77 0link
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
address
![Page 56: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/56.jpg)
[ 56 ][ 56 ]
BOLT: Implementation Recap
Three key implementation efficiencies in BOLT1. Re-use of existing renaming hardware
2. Hybrid register management algorithm in single register file
3. Efficient management of loads and stores
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Chained Store Buffer
![Page 57: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/57.jpg)
[ 57 ][ 57 ]
Experimental Evaluation
SPEC 2006 Benchmarks • Focus on memory bound programs (TurboBoost gets < 15%)
Performance: detailed cycle-level timing simulation in x86• Baseline “Core i7” (includes prefetching)
Energy: re-execution overhead + new structures• Estimate energy of new structures using CACTI-4.1 [Tarjan06]
![Page 58: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/58.jpg)
CFP vs. BOLT
[ 58 ][ 58 ]
• Speedups: Overall 5% 11% MEM 14% 18%
Hig
her
Is b
ett
er
![Page 59: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/59.jpg)
CFP vs. BOLT
[ 59 ][ 59 ]
• Re-execution: increases due to more latency tolerance
Hig
her
Is b
ett
er
Low
er
is
bett
er
![Page 60: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/60.jpg)
CFP vs. BOLT
[ 60 ][ 60 ]
• ED2: overall improvement• Fewer and simpler new structures (lower energy)• Increased re-executions typically correspond to higher performance
Hig
her
Is b
ett
er
Low
er
is
bett
er
Low
er
is
bett
er
![Page 61: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/61.jpg)
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects
Other work and future plans
[ 61 ][ 61 ]
![Page 62: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/62.jpg)
Non-Blocking Latency Tolerance
Latency tolerance = non-blocking execution• Re-execution should not block pipeline either• Suppose B & C miss (C depends on B)• C should also not block the pipeline: reapply latency tolerance
[ 62 ][ 62 ]
Time
![Page 63: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/63.jpg)
Execution Inefficiency
Dynamic inefficiency: excessive multiple re-execution• Observe: multiple re-execution dependence on multiple loads• Two possibilities: loads in parallel or loads in series• Different approaches to each
[ 63 ][ 63 ]
C
A B
C
A
B
![Page 64: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/64.jpg)
Loads in Parallel
Example: accumulating sumfor(i = 0; i < n; i++)
total += array[i];
Assembly:loop:
load [r1] -> r2
add r2 + r3 -> r3
add r1 + 4 -> r1
bnz r1, r5 loop
load
addload
addload
addload
addload
addload
addload
addload
add
add
add
add
add
add
add
add
bnz
bnz
bnz
bnz
bnz
bnz
bnz
![Page 65: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/65.jpg)
Loads in Parallel
Example: accumulating sumfor(i = 0; i < n; i++)
total += array[i];
Assembly:loop:
load [r1] -> r2
add r2 + r3 -> r3
add r1 + 4 -> r1
bne r1, r5 loop
load
addload
addload
addload
addload
addload
addload
addload
add
add
add
add
add
add
add
add
bnz
bnz
bnz
bnz
bnz
bnz
bnz
![Page 66: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/66.jpg)
Loads in Parallel
[ 66 ][ 66 ]Time
A
BC
DE
FG
HI
JK
LM
NO
P
A
B
D
F
H
J
L
N
P
C
D
MLP!
Energy:3.8xDelay: 0.4x
ED2: 0.6x
Keep PerformanceReduce re-executions
![Page 67: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/67.jpg)
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
[ 67 ][ 67 ]
A
BC
DE
FG
H
![Page 68: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/68.jpg)
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
B now executes correctly, provides antidote to D• D must capture this input
[ 68 ][ 68 ]
A
BC
DE
FG
H
![Page 69: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/69.jpg)
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
B now executes correctly, provides antidote to D• D must capture this input
D is still poisoned by C, cannot provide antidote
[ 69 ][ 69 ]
A
BC
DE
FG
H
![Page 70: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/70.jpg)
Join Pruning
A’s miss poisoned B… so A’s return provides its antidote
B now executes correctly, provides antidote to D• D must capture this input
D is still poisoned by C, cannot provide antidote
F is not receiving any antidote, no need to re-execute
[ 70 ][ 70 ]
A
BC
DE
FG
H
![Page 71: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/71.jpg)
[ 71 ][ 71 ]
Antidote Vector
BOLT filters re-execution using an antidote bit-vector• Track (per-logical register) if antidote is available• Also through store to load dependences (know poisoning store)
Antidote
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Chained Store Buffer
![Page 72: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/72.jpg)
Join Pruning
[ 72 ][ 72 ]Time
A
BC
DE
FG
HI
JK
LM
NO
P
D
F
B
A
C
D
H
J
L
N
P
E
FG
HI
JK
LM
NO
P
Energy:
2.8x
Delay:0.4
x
ED2:0.45
x
Energy:
3.8x
Delay:0.4
x
ED2:0.6
x
![Page 73: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/73.jpg)
Join Pruning Performance
[ 73 ][ 73 ]
• Performance: strictly better (especially lbm)
Hig
her
Is
bett
er
![Page 74: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/74.jpg)
Join Pruning Performance
[ 74 ][ 74 ]
• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)
Hig
her
Is
bett
er
Low
er
is
bett
er
![Page 75: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/75.jpg)
Join Pruning Performance
[ 75 ][ 75 ]
• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)• ED2: overall improvements (again, especially lbm)
Hig
her
Is
bett
er
Low
er
is
bett
er
Low
er
is
bett
er
![Page 76: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/76.jpg)
Loads in Series
Example: Count elements of linked listwhile (node != NULL) {
count++;
node = node->next;
}
Assemblyloop:
load [r1] -> r1
add r2 + 1 -> r2
bnz r1, loop
[ 76 ][ 76 ]
load
add
…
bnzload
add bnzload
add bnzload
add bnzload
add bnzload
add bnzload
add bnz
![Page 77: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/77.jpg)
D
G
J
M
Pointer Chasing
[ 77 ][ 77 ]Time
A
CB D
FE G
IH J
LK M
ON …
Energy:
2.2x
Delay: 1x
ED2: 2.2x
A
CD
FG
IJ
LM
O
B
E
H
K
N
Dr! Dr! It hurts when I apply latency tolerance to pointer
chasing…
![Page 78: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/78.jpg)
D
G
J
M
So Don’t Do It…
[ 78 ][ 78 ]Time
A
CB
FE
IH
LK
ON …
Energy:
1xDelay: 1x
ED2: 1x
Energy:
2.2x
Delay: 1x
ED2:2.2
x
![Page 79: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/79.jpg)
Loads in Series
Not all dependent loads are badfor (int i =0; i < n; i++)
x += objects[i]->val;
Assemblyloop:
load [r1] -> r2
load [r2] -> r3
add r4, r3 -> r4
add r1, 4 -> r1
bne r1, r5 loop
Important: prune pointer chasing only• Preserve general indirection
[ 79 ][ 79 ]
load
load
load
load
load
load
load
load
add
add
add
add
add
add
add
add
load
load
load
load
load
load
load
load
Parallel
Parallel
![Page 80: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/80.jpg)
Pointer Chasing
How to distinguish the two?
[ 80 ][ 80 ]
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
![Page 81: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/81.jpg)
Pointer Chasing
How to distinguish the two?• Pointer chasing: load poisons younger instances of itself
[ 81 ][ 81 ]
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
![Page 82: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/82.jpg)
Pointer Chasing
How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load
[ 82 ][ 82 ]
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
![Page 83: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/83.jpg)
Pointer Chasing
How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load
• Loop induction not chain of poison loads
[ 83 ][ 83 ]
loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2
loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1
![Page 84: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/84.jpg)
[ 84 ][ 84 ]
Extended Antidote Vector
Idea: extend poison information with low bits of PC• Poison from same PC pointer chasing
One implementation: detect at execution• Shuts pointer-chasing down immediately• Complicates latency-critical execution structures
A better one: detect at re-dispatch (extend antidotes) • Learn identity of pointer-chasing PC and shut down future instances
Antidote
Slice BufferAD
Chk
IssueQueu
e
Register
FileFU
D$
I$
UV
U VT T
RenameFetch
Reorder BufferT HKL E
Chained Store Buffer
![Page 85: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/85.jpg)
Pointer Chasing Performance
[ 85 ][ 85 ]
• Speedups: same (good: no harm)
Hig
her
Is
bett
er
![Page 86: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/86.jpg)
Pointer Chasing Performance
[ 86 ][ 86 ]
• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)
Hig
her
Is
bett
er
Low
er
is
bett
er
![Page 87: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/87.jpg)
Pointer Chasing Performance
[ 87 ][ 87 ]
• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)• ED2: overall improvement (mcf basically breaks even now)
Hig
her
Is
bett
er
Low
er
is
bett
er
Low
er
is
bett
er
![Page 88: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/88.jpg)
BOLT vs. TurboBoost
[ 88 ][ 88 ]
BOLT able to help performance where TurboBoost cannot
…and more energy efficiently
Hig
her
Is
bett
er
Low
er
is
bett
er
![Page 89: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/89.jpg)
BOLT vs. TurboBoost
[ 89 ][ 89 ]
BOLT + TurboBoost? • Synergistic: BOLT “un-memory-bounds” programs• BOLT + TurboBoost still an ED2 win!
Hig
her
Is
bett
er
Low
er
is
bett
er
![Page 90: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/90.jpg)
Partial Summary
Latency tolerance • Scale window virtually under long cache misses• No good implementations + excessive overhead• Potentially good complement to TurboBoost
Energy-efficient latency tolerance• Low-cost implementation: re-use SMT, registers & load/stores• Low runtime overhead: prune pointer-chasing and “joins”• Actually good complement to TurboBoost• Applicable to both in-order and out-of-order cores
[ 90 ][ 90 ]
![Page 91: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/91.jpg)
[ 91 ][ 91 ]
iCFP: In-order Latency Tolerance
BOLT – (out-of-order core) + (in-order core) = ?
Antidote
Slice Buffer Chk
IssueQueu
e
Register
FileFU
D$
I$
RenameFetch
Reorder Buffer
Chained Store Buffer
![Page 92: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/92.jpg)
[ 92 ][ 92 ]
iCFP: In-order Latency Tolerance
BOLT – (out-of-order core) + (in-order core) = ?
Chained Store Buffer
Slice Buffer Chk
Antidote
![Page 93: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/93.jpg)
[ 93 ][ 93 ]
iCFP: In-order Latency Tolerance
BOLT – (out-of-order core) + (in-order core) = iCFP [HPCA’09]
• Some details obviously different due to in-order pipeline• Useful for any level cache miss (L1, L2, L3)• Joint work with Santosh Nagarakatte
Other in-order latency tolerant designs• Sun’s Rock “processor” [Chaudhry’09]
• Simple Latency Tolerant Processor [Nekkalapu’09]
RF 0
RF 1
FetchI$
FU
D$
Chained Store Buffer
Slice Buffer Chk
Antidote
![Page 94: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/94.jpg)
Talk Outline
Introduction
Background: memory latency & latency tolerance
My work: energy efficient latency tolerance in BOLT
Other work and future plans
[ 94 ][ 94 ]
![Page 95: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/95.jpg)
Other Work / Future Directions
Micro-architecture• Control independence [ISCA’07]
• Plans for more work related to latency tolerance• Store latency tolerance• Possibly changes out-of-order “sweet spot”
• In submission / in progress: energy efficient load/store data path• Trident: reduce D$ accesses improve energy + perf• SMT-directory: reduce lq coherence searches in SMT
• Future work: register reference counting register file gating• Generally interested in performance and energy
[ 95 ][ 95 ]
![Page 96: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/96.jpg)
Other Work / Future Directions
Simulation and workload methodologies• Multi-programming workload methodology [MobS’09]
• Future plans include adapting ideas to multi-threaded applications• Generally interested in research on better simulation
Operating systems and security• Operating system based security project for layered sandboxing
• Provides system calls to restrict behavior of less trusted code• Many future plans on this project, most involving hardware support• Generally interested in how hardware can improve software
[ 96 ][ 96 ]
![Page 97: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/97.jpg)
[ 97 ][ 97 ]
![Page 98: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/98.jpg)
[ 98 ][ 98 ]
Sun’s Rock
Rock [Chaudhry’09] does in-order latency tolerance• Slice buffer (“Deferral Queues”) divided by multiple checkpoints• Re-execution limited to oldest region• Values from slices reintegrated to main register file when DQs empty
RF 0
RF 1
FetchI$
FU
D$
Slice Buffer
ChkChkChk
ADHKL E
![Page 99: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/99.jpg)
Unrolled Loops
What if compiler unrolled loop with pointer chasing?• Still detectable, just takes one time per unrolling
[ 99 ][ 99 ]
loop1: load [r1] -> r1 bz r1, endMidLoop load [r1] -> r1 add r2 + 2 -> r2 bnz r1, loop1
![Page 100: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/100.jpg)
[ 100 ][ 100 ]
![Page 101: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/101.jpg)
FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
[ 101 ][ 101 ]
![Page 102: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/102.jpg)
FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
[ 102 ][ 102 ]
![Page 103: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/103.jpg)
Other Work: FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
Variable workloads: run both until sum = X million insns
[ 103 ][ 103 ]
![Page 104: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/104.jpg)
Other Work: FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
Variable workloads: run both until sum = X million insns
[ 104 ][ 104 ]
Two very different answers for one question…
Why is this? Which is the right answer?
![Page 105: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/105.jpg)
[ 105 ][ 105 ]
Traditional Fixed-Workload
Single-program workload x N• X insns (i.e. 5M/sample) from each program• Workload composition is fixed across experiments + Direct comparisons between experiments– Load imbalance: time spent executing only slowest programs
A:A: 5M5M
B:B: 5M5M
timetime
![Page 106: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/106.jpg)
[ 106 ][ 106 ]
Variable-Workload
Multi-program execution defines workload• Execute all programs until some condition (i.e. total insns = 10M)• Normalize to single-program region defined by this execution+Eliminates load imbalance (by construction)- Naturally oversamples programs which perform better
A:A: 3M3M
B:B: 7M7M
timetime
![Page 107: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/107.jpg)
[ 107 ][ 107 ]
Deconstructing Load Imbalance
Fixed-workload runs experience two forms of imbalance
Sample imbalance: different standalone runtimes• Artifact of finite experiments• Should be eliminated• Easy: choose samples with same standalone runtimes
Schedule imbalance: asymmetric (“unfair”) contention• Characteristic of concurrent execution• Should be preserved, measured
![Page 108: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/108.jpg)
[ 108 ][ 108 ]
FIESTA
FIESTA: Fixed-Instruction with Equal STAndalone runtimes• Run single-programs for C cycles, record insn count• Build fixed workloads from time-balanced samples+ Eliminates sample imbalance+ Remaining imbalance is schedule imbalance
A:A: 5M5M
B:B: 7M7Mtimetime
A:A: 5M5M
B:B: 7M7M
timetime
schedule imbalanceschedule imbalance
![Page 109: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/109.jpg)
Other Work: FIESTA
Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?
Fixed workloads: run all programs for X million insns
Variable workloads: run both until sum = X million insns
FIESTA [MobS’09]: create a-priori balanced samples
Joint work with Neeraj Eswaran[ 109 ][ 109 ]
![Page 110: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/110.jpg)
Other Work: Paladin
Large software systems: many components, different trust• No way to restrict behavior of called modules
[ 110 ][ 110 ]
Trusted Code
Junior Developer’s
Module
PluginThird Party
Library
![Page 111: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812f0b550346895d94a84d/html5/thumbnails/111.jpg)
Other Work: Paladin
Large software systems: many components, different trust• No way to restrict behavior of called modules
Paladin [In submission]: OS support for layered sandboxing• New system calls to restrict system call behavior• Also ensure restrictions only removed when module returns• Joint work with Jeff Vaughan
[ 111 ][ 111 ]
Trusted Code
Junior Developer’s
Module
PluginThird Party
Library