a position-insensitive finished store buffer

A Position-Insensitive Finished Store Buffer

Erika Gunadi and Mikko H. LipastiDepartment of Electrical and Computer

EngineeringUniversity of Wisconsin—Madison

http://www.ece.wisc.edu/~pharm

2

Motivation As microprocessors get wider and

deeper More in-flight stores Need a larger store queue Increase access time and power consumption

Needs SQ access time <= D$ access time Avoid replay in case of store-to-load

forwarding

3

A Brief Store Queue Overview Serve 2 main purposes:

To maintain the order of in-flight stores To forward store data to later loads

Commonly designed as a circular buffer Allocate entry on dispatch Deallocate entry on retirement

Equipped with forwarding logic CAM structure for address match Select logic to pick the youngest older

matching store

4

Store to Load Forwarding Each load needs to search the store queue

for any matching older stores Forwarding logic consists of 3 components:

Store Address CAM Select Logic Store Data RAM

Store Address

CAM

SelectLogic

StoreDataRAM

5

SQ Access Latency

Major components of latency: CAM and Select CAM is scalable, Select is not

0

1

2

3

4

5

24 48 96 192

Number of Store Queue Entry

Ac

ce

ss

La

ten

cy

(n

s)

CAM Latency Select Latency RAM Latency

SQ Energy per Access

0

50

100

150

200

250

24 48 96 192

Number of Store Queue Entry

En

erg

y p

er

Ac

ce

ss

(p

J)

CAM Energy Select Energy RAM Energy

Major component of energy : CAM

7

Outline Motivation and Background Finished Store Buffer (FSB)

Initial Study Details of Design

Methodology Results Conclusion

8

SQ Occupancy Study

Most of the time, there are <= 50% of stores are finished and waiting to retire

The number of waiting-to-retire stores does not scale linearly with the size of the OoO window

12, 20, 32, and 52 are used as the number of entry of our FSB for 128, 256, 512, 1024 window size

24-Entry Store Queue

0

20

40

60

80

100

Number of Finished Store Queue Entries

Per

cen

tag

e o

f Tim

e

gcc

twolf

vortex

avg

192-Entry Store Queue

0

20

40

60

80

100

0 10 20 30 40 50 60 70

Number of Finished Store Queue Entries

Per

cen

tag

e o

f Tim

e

gcc

twolf

vortex

avg

9

Finished Store Buffer

The forwarding logic only cares about waiting-to-retire stores As shown, only less than 50% of in-flight

stores ROB can be used to track store order Finished Store Buffer

Much smaller than conventional store queue Does not maintain positional store ordering

10

FSB Diagram

Allocate FSB entry at schedule Deallocate FSB entry at retirement FSB is maintained using a free-list A store is issued only if there is an

available entry

Fetch Dec Rnm Disp Queue Read Exe WB RetSched

FSB

Conventional SQ

11

Forwarding Logic Load checks the FSB for matching store FSB position does not reflect relative age Non-positional select logic

Same problem in a non-compacting scheduler Solutions: Buyuktosunoglu [SOC 2002], Robery

[US Patent], and Sassone [ISCA 2007] Solutions similar to that by Buyuktosunoglu is

used since it requires the least number of bits

12

Youngest Select Logic

4-entry FSB, 3-bits color (111:youngest, 000:oldest) Modification

Add one more bit and a simple reverse logic to handle wrap around Restructure the algorithm hierarchically, checking happens in

parallel

4 inputs

4 inputs

4 inputs

… … …

0

1

0

1

0

1

1

A1[3:0] A0[3:0]

A2[3:0]

S[3:0]S[2]

A2[2]

1100

1 0

1100

0000

0000

1100

01010100

1

0100 1

0

1

0

0

0

0

0

0

0

1

1

0

0

1

1

0

0

0

0

1

0

1

0

st A 0 0 0st A 0 0 1st A 1 0 0st A 1 0 1ld A 1 0 1

One hot select signal

13

FSB Corner Cases Deadlock avoidance

Happens when a store to issue is the oldest in the window and the FSB is full

Reserves an entry in the FSB for the oldest store

In order retirement Keeps the FSB index in the ROB entry, uses

it to index to FSB at retire Branch misprediction

Assigns store color to each branch Uses it to determine which FSB entries to

invalidate

14

Methodology Simplescalar / Alpha 3.0 tool set Machine configuration

12-stage pipeline, 4-wide machine 128 ROB, 96 PRF 32 LQ, 24 SQ, 32 scheduler 2 integer ALUs, 1 mult/div, 1 memory port I-Cache: 64KB, DM, 64B, 2-cycle D-Cache: 64KB, 4-way, 64B, 3-cycle L2: 2MB, 8-way, 128B, 8-cycle Memory: 150-cycle

15

Modeling To estimate timing and power for

the select logic Implemented in Verilog Synthesized using Synopsys Design

Compiler and LSI Logic’s gflxp 0.11 micron CMOS standard cell library

To estimate timing and power for RAM and CAM structures -> CACTI

16

Access Latency Comparison

Due to fewer entries, select logic for FSB is faster CAM latency is similar

Access Latency Comparison

0

1

2

3

4

5

FSB-12 SQ-24 FSB-20 SQ-48 FSB-32 SQ-96 FSB-52 SQ-192

Machine Configuration

Access L

ate

ncy (

ns)

CAM Select RAM

128-ROB 256-ROB 512-ROB 1024-ROB

17

Energy per Access Comparison

Fewer entries -> less CAM power Subarrays do not reduce energy, only latency

Energy per Access Comparison

0

50

100

150

200

250

FSB-12 SQ-24 FSB-20 SQ-48 FSB-32 SQ-96 FSB-52 SQ-192

Machine Configuration

En

erg

y p

er

Access (

pJ)

CAM Select RAM

128-ROB 256-ROB 512-ROB 1024-ROB

18

IPC Comparison (SPEC INT)

FSB: 12, 20, 32, 52 for different window sizes FSB-min: the most aggressive limit

To avoid stall, only needs 20%*machine-width*issue-retire stages 5, 10, 20, and 40 for different window sizes

Both FSB and FSB-min less than 1% average slowdown

0.8

0.85

0.9

0.95

1

1.05

No

rmal

ized

IPC

SQIP FSB FSBmin

128-ROB 256-ROB 512-ROB 1024-ROB

19

IPC Comparison (SPEC FP)

Sixtrack with 1024 ROB experiences 5% slowdown Retirement stall of unfinished stores Slowdown less than 1% with 2 reservation slots

In some cases, FSB slightly outperforms the baseline IPC Happens when the store queue size limits instructions dispatch in the

baseline

0.8

0.85

0.9

0.95

1

1.05

No

rmalized

IP

C

SQIP FSB FSBmin

128-ROB 256-ROB 512-ROB 1024-ROB

Prior Work SQIP [Sha, 2005]

Remove the associative search of SQ Loads use store-set to predict the index of a

forwarding SQ entry Misprediction is detected by precommit re-

execution, results in pipeline flush ULB-LSQ [Sethumadhavan, 2007]

Unordered SQ, allocated at issue time Similar to our approach Differs in forwarding policy and overflow

handling

21

Prior Work [Franklin, 1996]: ARB in Multiscalar [Sethumadhavan, 2003], [Park, 2003]: Filtering mechanism

(bloom filter and store set) to reduce store queue access [Baugh, 2004]: Decomposed store queue functionality, only

stores in forwarding group need to be put into the forwarding buffer

[Torres, 2005]: 2-level SQ, predicted forwarding stores in L1, validation is done in L2

[Roth, 2005]: SVW, breaking SQ functionality into RSQ and FSQ, validation is done using load re-execution

[Sha, 2005], [Stone, 2005]: SQIP and AIMD, removing the associative search capability from SQ

[Subramanian, 2006], [Sha, 2006]: FnF and NoSQ, eliminate the whole SQ, load re-execution for validation

[Sethumadhavan, 2007]: ULB-LSQ, unordered store queue that is allocated at issue time

22

Conclusion FSB, an alternative way to build the SQ Only contains finished stores

Much smaller More scalable

Minimal IPC impact, < 1% Lower power Possible higher frequency

FSB-min, a more aggressive approach Also has minimal IPC impact

Future work Load Queue Better deadlock handling

23

Thank you

Questions?

a position-insensitive finished store buffer

Documents