microbenchmarks and mechanisms for reverse engineering of branch predictor structures

Microbenchmarks and Mechanisms for Reverse Engineering

of Branch Predictor Structures

Vladimir Uzelac and Aleksandar MilenkovićLaCASA Laboratory

Electrical and Computer Engineering DepartmentThe University of Alabama in Huntsville

{uzelacv | milenka}@ece.uah.edu

2

Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction

Target Predictors Branch Target Buffer Indirect Branch Target Buffer

Outcome Predictors Loop Predictor Global/Bimodal Predictors

Conclusion

3

Motivation If we know branch predictor organization we could … Implement predictor-aware compiler optimizations

Code alignment to avoid BTB conflicts in critical code sections Code split to replace long correlations with shorter ones Camino environment [PLDI `05]

Have a “golden standard” for academic research Design tools for rapid BP

design space exploration and verification But, details are rarely publicly disclosed

In spite of hints in software optimization manuals Develop microbenchmarks and mechanisms for reverse

engineering of modern branch predictor units

4

Goals Microbenchmarks and mechanisms developed to

reverse engineer Pentium M’s branch predictor including Target predictor

BTB and IBTB Outcome predictor

Loop predictor Global outcome predictor Bimodal predictor

Branch predictor parameters Organization and size of all branch predictor structures Indexing, allocation, update, replacement policies Interdependencies between these structures

Validation of our effort through a functional PIN model

5

Presentation Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction



Conclusion

6

Reverse Engineering Flow Goal: determine a specific branch predictor

parameter (e.g., BTB size) Design benchmark(s) to stress

the parameter Influenced by the type

of observable events Build expectations for relevant event(s)

based on back-of-the-envelope analysis Execute benchmarks and

collect events (Vtune) Compare expectations with actual results Retire findings or modify benchmark Verify findings using functional PIN model

Goal:Branch Parameter

Microbenchmark to stress the parameter

Observable Events

BuildExpectations

Collect Events(Run in Vtune)

# of expectedevents

# of collected events

=

BP FunctionalModel

To PIN verification

YesParameterExtracted

No(Revisit Microbench.)

7

Outline Goals and Motivation Reverse Engineering Flow Predictors Details Deconstruction



Conclusion

Branch Target Buffer (BTB)

Background: BTB is a cache structure Instructions are fetched

in 16-byte blocks (Intel) Can have multiple

branches per line BTB can have multiple

hits (same tags) => Offset field in each entry => Offset algorithm selects

the target among several offered

8

Try to find: Number of BTB entries (NBTB) Number of sets (NSETS) Number of ways (NWAYS) Index, Tag bits Offset bits and presence of

offset algorithm Bogus branches handling Replacement policy

TAG Target Offset

WAY NWAY

1

Repl.Bits

0

NSETS

IP

BTB

Core BTB Test Use B taken branches at the

distance D from each other Code executed many times to

amplify effects on performance counters Control how these branches

are presented to BTB To cope with different allocation policies Here, we execute each branch twice consecutively

Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters

9

Branch 1

Branch 2

Branch B

~~

D

128256

5121024

0%20%40%60%

80%

100%

2 4 8 16 32 64 128

MPR

DB

10

BTB Capacity Tests Try to fill whole BTB using very small distances between branches Example: 4-way BTB with 512 entries, BTB index = IP[10:4] NBTB branches can fit for three distances

Branches fill sets consecutively For larger D, MPR = f(B,D)

Branches jump over sets For very small D, there are

more branches in the line than sets

MPR exist for any D if B>NBTB

MPR = f(B,D, BTB parameters)can be mathematically formalized

Branch 4

WAY 3

Branch 3Branch 2

EvictOne

Branch 1

WAY 1

WAY 2

NSET

0

WAY 3

Branch 5

Branch 5

Branch 4

WAY 3

Branch 3Branch 2

EvictOne

Branch 1

WAY 1

WAY 2

NSET

0

WAY 3

Branch 5

BTB Set Tests Try to fill one BTB set varying distance D When D > NSET all branches

collide in one set MPR is a function of B only

(only 4 branches can fit) Helps finding NWAYS and Index MSB

When D > NSET, change D’ between lasttwo to find Index LSB

D’ for which MPR disappear determines Index LSB

When D over Tag MSB distance, false hits occur

Only two branches produce MPR

11

...Branch 1

WAY 1

WAY N

Branch 2

False Hit

NSET

0

Index OffsetTagNot UsedIP

D=2TAG.MSB + 1

12

BTB Findings Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4 Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the

lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU

Index = IP [12:4] Tag = IP [21:13]

Way 0Way 3

Branch target buffer (BTB)

0

511

Target(32 bits)

BTB hit

BTB target

Type (2-3 bits)

Tag (9 bits)

BTB typeOffset (4 bits)

PLRU(3 bits)

IP[31:4]

13




Conclusion

Indirect Branch Target Buffer (IBTB)Background: Target predictor indexed

by program-path informationTry to find:1. Which branch parts affect the

PIR during update?2. How is PIR updated?3. Which branch IP bits affect

the hash access function? 4. What is hash access function?5. What are Index and Tag

fields?6. What is IBTB organization?

14

PIR IBTB

INDEX TAG

IP

Hash

Tag Target

031 n m3

4

5

6

PIR

F

IP031 q p

TA031 s r

Retired branch TypeOutcome

1

2

Branch 1Branch 2

Branch 3Newest Branch PIR Shift and Add

Branch 1Branch 2

Branch 3

PIR Shift

Path Information Register: Background

PIR is a (shift) register – updated with program branches Different ways to allocate newly occurred branch :

Shift and Add (add to lowest PIR bits)

Shift and Add with interleave(better indexing)

Shift and XOR

15

and AddNewest Branch

PIR Shift and Add with InterleaveNewest Branch

PIRShift count = 2

0 0

BRANCH BITS

XO

R

XO

R

XO

R

XO

R

XO

R

XO

R ...Shift And XOR

P1.SB2

iSpy

Target1 Target2

...

P1.SBN

...

P2.SBN

P2.SB2

PIR 1 PIR 2

P1.SB1 P2.SB1

Dq=2q+1

Dq=2q+1

16

PIR Organization Test PIR is the same prior to both Target1 and Target2

Branches are at large distance from each other (> 2q)

P1.SB1 and P2.SB1 differ in one bit – k = log2D If bit k affects the PIR there is no collisions and opposite

H block – H branches that affect the PIR For large H, P1.SB1 and P2.SB1 shifted out of PIR

Analysis MPR = f(H, D) gives following answers PIR History depth Which branch address/target bits affect the PIR PIR Update mechanism details (XOR or Add…)

P1.SB1 and P1.SB1 replaced with different types of branches

Both address and target bits tested in this way

PIR

F

IP031 q p

TA031 s r

Retired branch TypeOutcome

1

2

Dq=2q+1

H BlockD=2k

DIP=2l

P1.SB2

Target1

...

P1.SBN

...

P2.SBN

P2.SB2

PIR 1 PIR 2

H Block

Target1

D=2k

P1.SB1 P2.SB1

iSpy1 iSpy2iSpy

P1.SB2

Target1

...

P1.SBN

...

P2.SBN

P2.SB2

PIR 1 PIR 2

H Block

Target1

D=2k

P1.SB1 P2.SB1

17

IBTB Access Hash Function Test

Find which PIR and branch IP bits are XORed in the iBTB access hash function

Previously we found XOR

Reuse previous test Difference at P1.SB1 and P2.SB2

bit k makes targets not to collide

Use two Spies at distance DIP = 2l

If bits l and k are XORed in the hash function difference in PIR values is annulated

PIR IBTB

INDEX TAG

IP

Hash

Tag Target

031 n m3

4

5

6

18

IBTB Organization Test Employ N indirect branch targets

to fill iBTB in different ways By using N different PIR values

SB1…SBN create N different PIRs to the each of iSpy target SB1…SBN are at distance D=2k

from each other MPR = f(D,N) sufficient to

find IBTB organization Similarly as for the BTB

D=2k

P0.SB8…

P0.SB1

Dispatch

SB1 ...P1 P2 PN

iSpy

Target1 Target2 TargetN

1 2 N

1N 2...

N-1

...

SB2 SBN

19

IBTB Predictor Findings1. Which branch parts affect the PIR during update?

15 IP bits from conditional branch IP Combined 15 bits from indirect branch target and IP

2. How is PIR updated? Shifted for two bits left

prior to update (XOR)3. Which branch IP bits affect

the hash access function? 15 bits, IP[18:4]

4. What is hash access function? XOR

5. What are Index and Tag fields? Index = HASH[13:6], Tag = IP[14,5:0]

6. What is IBTB organization? A direct-mapped cache with 256 entries

IBTB

Target 32 bit

0

255

Index = HASH [13:6]Tag = HASH [14,5:0]

BTB hit

hittarget

Indirectpredictor hit

BTB target

Predicted target

Tag 7 bits

PIRIP

HASH

XOR

14 018 4

14 0

20




Conclusion

21

Loop PredictorWhat do we know? Each entry has two counters Counter MAX_VAL stores the loop

branch maximum count value Counter CURR_VAL stores the loop

branch current iterationAssumptions: Loop BP is an IP indexed cacheTry to find: Counters’ length Size and organization of the loop branch predictor buffer (Loop BPB) Allocation policy (when a branch becomes a candidate for a loop branch) Training policy – how new loop branch MAX_VAL is set

CURR_VAL MAX_VAL Prediction+1

0=

Prediction

22

Loop Counters Size Test

Test: “spy” loop (LSpy) has loop modulo L MPR exists if L > MAX_VAL counter lengthResults: Maximum predictable L is 64 (6-bit counters)

LSpy

L times Enter

Exit

23

Loop BPB Capacity and Set Tests Similar to the BTB Capacity/Set tests Employ B loops at the distance D

from each other MPR is a function of B, D and Loop BPB

parameters similarly as for the BTB

Branch 1

Branch 2

Branch B

~~

D

Increase Counter

COUNTER =COUNTER MAX.

Increase Counter


Increase Counter


Loop B

Loop 1

Loop 2

D

~~

24

Loop BPB Capacity and Set Tests Counters’ length: 6 bits Size and organization of the loop branch predictor buffer

Two-way cache with 128 entries Index = IP[9:4], Tag = IP[15:10]

Allocation policy: Branch allocated on first opposite outcome Training policy: Set MAX_VAL during 2nd loop iteration

MAX_VAL6 bits

CUR_VAL6 bits

Way 0

Hit

(Loop BPB)

Index = IP [9:4] Tag = IP [15:10]

Prediction0

64

Tag 6 bits

Way 1

Pred.1 bit

25




Conclusion

26

Global and Bimodal PredictorWhat do we know? All branches predicted dynamically

At least one predictor not tagged

Assumptions: Cascade organization

Bimodal predictor is not tagged Global predictor can correct Bimodal

Global is path indexed (BHR register)Try to find: Organization of Global Predictor Indexing to Global predictor (BHR and hashing function details) Bimodal predictor details

Size only (not tagged) Indexing bits (IP indexed)

27

BHR Organization Test

Similar to PIR Organization test iSpy with two targets replaced with the

conditional branch (cSpy) with two outcomes MPR =f(D, H) sufficient to find

BHR organization

Results: BHR affected in the same way as the PIR

BHR and PIR are the same registercSpy

H Block

P1.SB1

Target2 (T)

Target1 (nT)

... ...

P2.SB1

P2.SB2

P2.SBNP1.SBN

P1.SB2

D=2k

28

Global Predictor Organization Test

Similar to IBTB Organization test N different paths to cSpyN (always not taken) PIR values depend on distance D

cSpyN allocated to up to N different entries Similar to IBTB, MPR=f(D,N) is sufficient

to determine the predictor organization Eliminate correct prediction from

Bimodal predictor: cSpyT distance from SpyN is large –

target the same Bimodal entry Paths occurrence pattern:

T*PT, PN1, T*PT, PN2, …, T*PT, PNN, … Eliminate correct prediction from Loop Predictor if needed

...PN1

cSpyNH

Dispatch

PN2 PNNPT

P0.SB7…

P0.SB1

D

SB2 SBNSB1SBT

cSpyNcSpyT

29

Bimodal Predictor Organization Test

Reuse the previous test Make contentions in Global predictor

Change distance between cSpyT and cSpyN to try predicting brancheswith the Bimodal predictor

DG =2k

No contentions in Bimodal Predictor if bit k is used for Bimodal Index

...PN1

cSpyNH

Dispatch

PN2 PNNPT

P0.SB7…

P0.SB1

SB2 SBNSB1SBT

cSpyNcSpyTDG

30

Global and Bimodal Predictor Findings

Global: 4-way cache structure with 2048 entries Accessed with the hash function - PIR XORed with conditional branch IP

9 bits used as the index, 6 bits as the tagBimodal: A table with 4096 bimodal counters Indexed with IP [11:0]

Bimodal 2 bit

0

511

HitPrediction

Tag 6 bits

PIRIP

HASH

XOR

14 018 4

14 0

Index = HASH [14:6]Tag = HASH [5:0]

Global Predictor

Way 0

Way 3

31




Conclusion

Limitations and Verification Generalization of reverse engineering flow is difficult

Different branch prediction organizations Implementation of microbenchmarks is a challenging task

Balance of observability of certain parameters and isolation of different parameters that share the same event

Certain knowledge on targeted predictor is needed E.g. Prediction in cache lines (AMD K8)

Tests must cover large design space Verification

Using PIN model – achieved more than 95% accuracy

32

Conclusion Microbenchmarks and mechanisms for reverse

engineering of path- or IP- indexed predictor structures Demonstrated on Pentium M

BTB, IBTB, Loop, Global/Bimodal

33

Offset = IP [3:0] Index = IP [12:4] Tag = IP [21:13]

Way 0Way 3

Branch target buffer (BTB)

0

511

Target(32 bits)

BTB hit

BTB target

Limit(6 bits)

Count(6 bits)

Way 0Way 1

Looppredictor hit

Loop branch predictor buffer (LPB)

Index = IP [9:4] Tag = IP [15:10]

Indirect target cache (iBTB)

Target (32 bit)

0

255

Index = HASH [13:6]Tag = HASH [14,5:0]

iBTB hit

Way 0Way 3

Global predictor

0

511

2bC

Globalpredictor hit

Index = HASH[14:6] Tag = HASH[5:0]

Bimodal Table

2bCIndex = IP[11:0]

Bimodaloutcome prediction

Globaloutcome prediction

Outcome prediction

iBTB target

0

63

Path Information Register (PIR)

Current Instruction

IP address

XOR Hash Access Function (HASH)15 bits

14 0

Type (2-3 bits)

Tag (9 bits)

BTB type

Tag (6 bits)

Tag (7 bits)

Tag (6 bits)

Offset (4 bits)

0

4095

BTB hit

LPB hit

Loopoutcome prediction

Loopoutcome prediction

PLRU(3 bits)

Prediction(1 bit)

14 0

microbenchmarks and mechanisms for reverse engineering of branch predictor structures

Documents