a position-insensitive finished store buffer
DESCRIPTION
A Position-Insensitive Finished Store Buffer. Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison. http://www.ece.wisc.edu/~pharm. Motivation. As microprocessors get wider and deeper More in-flight stores - PowerPoint PPT PresentationTRANSCRIPT
A Position-Insensitive Finished Store Buffer
Erika Gunadi and Mikko H. LipastiDepartment of Electrical and Computer
EngineeringUniversity of Wisconsin—Madison
http://www.ece.wisc.edu/~pharm
2
Motivation As microprocessors get wider and
deeper More in-flight stores Need a larger store queue Increase access time and power consumption
Needs SQ access time <= D$ access time Avoid replay in case of store-to-load
forwarding
3
A Brief Store Queue Overview Serve 2 main purposes:
To maintain the order of in-flight stores To forward store data to later loads
Commonly designed as a circular buffer Allocate entry on dispatch Deallocate entry on retirement
Equipped with forwarding logic CAM structure for address match Select logic to pick the youngest older
matching store
4
Store to Load Forwarding Each load needs to search the store queue
for any matching older stores Forwarding logic consists of 3 components:
Store Address CAM Select Logic Store Data RAM
Store Address
CAM
SelectLogic
StoreDataRAM
5
SQ Access Latency
Major components of latency: CAM and Select CAM is scalable, Select is not
0
1
2
3
4
5
24 48 96 192
Number of Store Queue Entry
Ac
ce
ss
La
ten
cy
(n
s)
CAM Latency Select Latency RAM Latency
SQ Energy per Access
0
50
100
150
200
250
24 48 96 192
Number of Store Queue Entry
En
erg
y p
er
Ac
ce
ss
(p
J)
CAM Energy Select Energy RAM Energy
Major component of energy : CAM
7
Outline Motivation and Background Finished Store Buffer (FSB)
Initial Study Details of Design
Methodology Results Conclusion
8
SQ Occupancy Study
Most of the time, there are <= 50% of stores are finished and waiting to retire
The number of waiting-to-retire stores does not scale linearly with the size of the OoO window
12, 20, 32, and 52 are used as the number of entry of our FSB for 128, 256, 512, 1024 window size
24-Entry Store Queue
0
20
40
60
80
100
Number of Finished Store Queue Entries
Per
cen
tag
e o
f Tim
e
gcc
twolf
vortex
avg
192-Entry Store Queue
0
20
40
60
80
100
0 10 20 30 40 50 60 70
Number of Finished Store Queue Entries
Per
cen
tag
e o
f Tim
e
gcc
twolf
vortex
avg
9
Finished Store Buffer
The forwarding logic only cares about waiting-to-retire stores As shown, only less than 50% of in-flight
stores ROB can be used to track store order Finished Store Buffer
Much smaller than conventional store queue Does not maintain positional store ordering
10
FSB Diagram
Allocate FSB entry at schedule Deallocate FSB entry at retirement FSB is maintained using a free-list A store is issued only if there is an
available entry
Fetch Dec Rnm Disp Queue Read Exe WB RetSched
FSB
Conventional SQ
11
Forwarding Logic Load checks the FSB for matching store FSB position does not reflect relative age Non-positional select logic
Same problem in a non-compacting scheduler Solutions: Buyuktosunoglu [SOC 2002], Robery
[US Patent], and Sassone [ISCA 2007] Solutions similar to that by Buyuktosunoglu is
used since it requires the least number of bits
12
Youngest Select Logic
4-entry FSB, 3-bits color (111:youngest, 000:oldest) Modification
Add one more bit and a simple reverse logic to handle wrap around Restructure the algorithm hierarchically, checking happens in
parallel
4 inputs
4 inputs
4 inputs
… … …
0
1
0
1
0
1
1
A1[3:0] A0[3:0]
A2[3:0]
S[3:0]S[2]
A2[2]
1100
1 0
1100
0000
0000
1100
01010100
1
0100 1
0
1
0
0
0
0
0
0
0
1
1
0
0
1
1
0
0
0
0
1
0
1
0
st A 0 0 0st A 0 0 1st A 1 0 0st A 1 0 1ld A 1 0 1
One hot select signal
13
FSB Corner Cases Deadlock avoidance
Happens when a store to issue is the oldest in the window and the FSB is full
Reserves an entry in the FSB for the oldest store
In order retirement Keeps the FSB index in the ROB entry, uses
it to index to FSB at retire Branch misprediction
Assigns store color to each branch Uses it to determine which FSB entries to
invalidate
14
Methodology Simplescalar / Alpha 3.0 tool set Machine configuration
12-stage pipeline, 4-wide machine 128 ROB, 96 PRF 32 LQ, 24 SQ, 32 scheduler 2 integer ALUs, 1 mult/div, 1 memory port I-Cache: 64KB, DM, 64B, 2-cycle D-Cache: 64KB, 4-way, 64B, 3-cycle L2: 2MB, 8-way, 128B, 8-cycle Memory: 150-cycle
15
Modeling To estimate timing and power for
the select logic Implemented in Verilog Synthesized using Synopsys Design
Compiler and LSI Logic’s gflxp 0.11 micron CMOS standard cell library
To estimate timing and power for RAM and CAM structures -> CACTI
16
Access Latency Comparison
Due to fewer entries, select logic for FSB is faster CAM latency is similar
Access Latency Comparison
0
1
2
3
4
5
FSB-12 SQ-24 FSB-20 SQ-48 FSB-32 SQ-96 FSB-52 SQ-192
Machine Configuration
Access L
ate
ncy (
ns)
CAM Select RAM
128-ROB 256-ROB 512-ROB 1024-ROB
17
Energy per Access Comparison
Fewer entries -> less CAM power Subarrays do not reduce energy, only latency
Energy per Access Comparison
0
50
100
150
200
250
FSB-12 SQ-24 FSB-20 SQ-48 FSB-32 SQ-96 FSB-52 SQ-192
Machine Configuration
En
erg
y p
er
Access (
pJ)
CAM Select RAM
128-ROB 256-ROB 512-ROB 1024-ROB
18
IPC Comparison (SPEC INT)
FSB: 12, 20, 32, 52 for different window sizes FSB-min: the most aggressive limit
To avoid stall, only needs 20%*machine-width*issue-retire stages 5, 10, 20, and 40 for different window sizes
Both FSB and FSB-min less than 1% average slowdown
0.8
0.85
0.9
0.95
1
1.05
No
rmal
ized
IPC
SQIP FSB FSBmin
128-ROB 256-ROB 512-ROB 1024-ROB
19
IPC Comparison (SPEC FP)
Sixtrack with 1024 ROB experiences 5% slowdown Retirement stall of unfinished stores Slowdown less than 1% with 2 reservation slots
In some cases, FSB slightly outperforms the baseline IPC Happens when the store queue size limits instructions dispatch in the
baseline
0.8
0.85
0.9
0.95
1
1.05
No
rmalized
IP
C
SQIP FSB FSBmin
128-ROB 256-ROB 512-ROB 1024-ROB
Prior Work SQIP [Sha, 2005]
Remove the associative search of SQ Loads use store-set to predict the index of a
forwarding SQ entry Misprediction is detected by precommit re-
execution, results in pipeline flush ULB-LSQ [Sethumadhavan, 2007]
Unordered SQ, allocated at issue time Similar to our approach Differs in forwarding policy and overflow
handling
21
Prior Work [Franklin, 1996]: ARB in Multiscalar [Sethumadhavan, 2003], [Park, 2003]: Filtering mechanism
(bloom filter and store set) to reduce store queue access [Baugh, 2004]: Decomposed store queue functionality, only
stores in forwarding group need to be put into the forwarding buffer
[Torres, 2005]: 2-level SQ, predicted forwarding stores in L1, validation is done in L2
[Roth, 2005]: SVW, breaking SQ functionality into RSQ and FSQ, validation is done using load re-execution
[Sha, 2005], [Stone, 2005]: SQIP and AIMD, removing the associative search capability from SQ
[Subramanian, 2006], [Sha, 2006]: FnF and NoSQ, eliminate the whole SQ, load re-execution for validation
[Sethumadhavan, 2007]: ULB-LSQ, unordered store queue that is allocated at issue time
22
Conclusion FSB, an alternative way to build the SQ Only contains finished stores
Much smaller More scalable
Minimal IPC impact, < 1% Lower power Possible higher frequency
FSB-min, a more aggressive approach Also has minimal IPC impact
Future work Load Queue Better deadlock handling
23
Thank you
Questions?