superscalar processor design - ernet

Superscalar Processor Design

Superscalar ArchitectureVirendra Singh

Computer Design and Test Lab.Indian Institute of Science

[email protected]

Advance Computer Architecture

2

Superscalar Pipelines

IF

ID

RD

ALU

MEM

WB


3

Superscalar PipelinesDynamic Pipelines

1. Alleviate the limitations of pipelined implementation

2. Use diversified pipelines

3. Temporal machine parallelism


4

Superscalar Pipelines (Diversified)

IF

ID

RD

WB

ALU Mem1

Mem2

FP1

FP2

FP3

BREX


5


Diversified Pipelines

Each pipeline can be customized for particular instruction type

Each instruction type incurs only necessary latency

Certainly less expensive than identical copies

If all inter-instruction dependencies are resolved then there is no stall after instruction issue

Require special consideration

Number and Mix of functional unitsAdvance Computer Architecture

6

Superscalar Pipelines (Dynamic Pipelines)

Dynamic Pipelines

Buffers are needed

Multi-entry buffers

Every entry is hardwired to one read port and one write port

Complex multi-entry buffers

Minimize stalls


7

Superscalar Pipelines (Interpipeline-Stage Buffers)

Stage i

Buffer (n)

Stage i+1

Stage i

Buffer (1)

Stage i+1

Single entry

Multi-entry BufferAdvance Computer Architecture

8

Superscalar Pipelines (Interpipeline-Stage Buffers)

Stage i

Buffer (>= n)

Stage i+1

In-order

Out-of-order

Trailing instructions are allowed to bypass stalled instruction

Out-of-order execution


9

Superscalar ArchitectureInstruction issue and machine parallelismIn-order issue with in-order completionIn-Order issue with Out-of-Order

completionOut-of-Order issue withOut-of-Order

Completion


10


IF

ID

RD

WB

ALU Mem1

Mem2

FP1

FP2

FP3

BREX


Superscalar Pipeline Stages

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

In Program

Order

In Program

Order

Outof

Order

11Advance Computer Architecture

12

Super-scalar Architecture

Instruction buffer

Dispatch buffer

Reservation station

Re-order/Completion buffer

Store buffer

Fetch

Decode

Dispatch

Complete

Retire

Execute

finish

Issue


Limitations of Scalar PipelinesScalar upper bound on throughput

• IPC <= 1 or CPI >= 1• Solution: wide (superscalar) pipeline

Inefficient unified pipeline• Long latency for each instruction• Solution: diversified, specialized pipelines

Rigid pipeline stall policy• One stalled instruction stalls all newer

instructions• Solution: Out-of-order execution, distributed

execution pipelines


14


Instruction buffer

Dispatch buffer

Reservation station


Store buffer

Fetch

Decode

Dispatch

Complete

Retire

Execute

finish

Issue


Impediments to High IPC

I-cache

FETCH

DECODE

COMMIT

D-cache

BranchPredictor Instruction

Buffer

StoreQueue

ReorderBuffer

Integer Floating-point Media Memory

Instruction

RegisterData

MemoryData

Flow

EXECUTE

(ROB)

Flow

Flow


Superscalar Pipeline Design

Instruction Fetching IssuesInstruction Decoding IssuesInstruction Dispatching IssuesInstruction Execution IssuesInstruction Completion & Retiring

Issues


Instruction Flow

Challenges:• Branches: control

dependences• Branch target misalignment• Instruction cache misses

Solutions• Code alignment (static

vs.dynamic)• Prediction/speculation

Instruction Memory

PC

3 instructions fetched

Objective: Fetch multiple instructions per cycle


18

Instruction Fetch

Fetch s instructions from I-cache

I-Cache must be wide enough that each row of the I-Cache array can store s instructions and that an entire row can be accessed

Fetch width = Row width

Assume access latency is 1 cycle


I-Cache Organization

Ro

w D

ec

od

er

•••

CacheLine

•••

TAG

TAG

Address

1 cache line = 1 physical row

•••Cache

Line•••

TAG

TAG

Address

1 cache line = 2 physical rows

TAG

TAGR

ow

De

co

der


20

Instruction Fetch

Impediments:

1. Instruction misalignment

• Solutions

• Static – Compiler

• Needs information about the I-Cache org, indexing, row size

• Dynamic

1. Presence of control flow change


Fetch Alignment


22

Instruction Fetch 2 – way set associative I-Cache with a line size of 16

instructions (64 bytes)

Each row of the I-Cache stores 4 associative sets (two per set) of instructions

Each line of I-cache spans four physical rows

Physical I-cache array is actually composed of 4 independent sub-arrays

One instruction can be accessed form one array


23

Instruction Fetch

A1 B1A5 B5A9 B9

A2 B2A6 B6A10 B10

A3 B3A7 B7A11 B11

A0 B0A4 B4A8 B8

TT T

IFAR

OddDirectory

set

OddDirectory

set

Instruction Buffer Network


POWER Fetch Hardware


Issues in Decoding

Primary Tasks• Identify individual instructions (!)• Determine instruction types• Determine dependences between

instructions

Two important factors• Instruction set architecture• Pipeline width


26

Decode Stage

Decodes instruction types

Detect inter-instruction dependencies

Complexity

ISA

Fetch width

Large number of comparators

Multi-ported RF


Pentium Pro Fetch/Decode


Instruction Dispatch and Issue

Parallel pipeline• Centralized instruction fetch• Centralized instruction decode

Diversified pipeline• Distributed instruction execution


Necessity of Instruction Dispatch


Centralized Reservation Station


Distributed Reservation Station


32

Instruction Dispatching Diversified pipeline

Different type instructions executed by different FU in different pipelines

Distributed control

Operands are fetched from RF

Operands may not be available

Reservation station

Centralized

DistributedAdvance Computer Architecture

Issues in Instruction Execution

Current trends• More parallelism bypassing very

challenging• Deeper pipelines• More diversity

Functional unit types• Integer• Floating point• Load/store most difficult to make parallel• Branch• Specialized units (media)

• Very wide datapaths (256 bits/register or more)Advance Computer Architecture

Bypass Networks

O(n2) interconnect from/to FU inputs and outputs

Associative tag-match to find operandsSolutions (hurt IPC, help cycle time)

• Use RF only (IBM Power4) with no bypass network

• Decompose into clusters (Alpha 21264)


35

Instruction Dispatching


36

Instruction Issue


37

Instruction Issue


Issues in Completion/Retirement

Out-of-order execution• ALU instructions• Load/store instructions

In-order completion/retirement• Precise exceptions• Memory coherence and consistency

Solutions• Reorder buffer• Store buffer• Load queue snooping (later)


39


Instruction buffer

Dispatch buffer

Reservation station


Store buffer

Fetch

Decode

Dispatch

Complete

Retire

Execute

finish

Issue


Program Control Flow

• Implicit Sequential Control Flow– Static Program Representation

• Control Flow Graph (CFG)• Nodes = basic blocks• Edges = Control flow transfers

– Physical Program Layout• Mapping of CFG to linear program memory• Implied sequential control flow

– Dynamic Program Execution• Traversal of the CFG nodes and edges (e.g. loops)• Traversal dictated by branch conditions

– Dynamic Control Flow• Deviates from sequential control flow• Disrupts sequential fetching• Can stall IF stage and reduce I-fetch bandwidth


Program Control Flow

• Dynamic traversal of static CFG

• Mapping CFG to linear memory


Disruption of Sequential Control Flow

Instruction/Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Reorder/

Store Buffer

Complete

Retire

StationsIssue

Execute

FinishCompletion Buffer

Branch


Branch Prediction

• Target address generation → Target Speculation– Access register:

• PC, General purpose register, Link register

– Perform calculation: • +/- offset, autoincrement, autodecrement

• Condition resolution → Condition speculation– Access register:

• Condition code register, General purpose register

– Perform calculation:• Comparison of data register(s)


Target Address Generation

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Store Buffer

Complete

Retire

StationsIssue

Execute

Finish Completion Buffer

Branch

PC-rel.

Reg.ind.

Reg.ind.withoffset


Condition Resolution

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Store Buffer

Complete

Retire

StationsIssue

Execute

Finish Completion Buffer

Branch

CCreg.

GPreg.valuecomp.


Branch Instruction Speculation


Branch/Jump Target Prediction

• Branch Target Buffer: small cache in fetch stage– Previously executed branches, address, taken history, target(s)

• Fetch stage compares current FA against BTB– If match, use prediction– If predict taken, use BTB target

• When branch executes, BTB is updated• Optimization:

– Size of BTB: increases hit rate– Prediction algorithm: increase accuracy of prediction

Branch inst. Information Branch targetaddress for predict. address (most recent)

0x0348 0101 (NTNT) 0x0612


Branch Prediction: Condition Speculation

1. Biased Not Taken– Hardware prediction– Does not affect ISA– Not effective for loops

1. Software Prediction– Extra bit in each branch instruction

• Set to 0 for not taken• Set to 1 for taken

– Bit set by compiler or user; can use profiling– Static prediction, same behavior every time

1. Prediction based on branch offset– Positive offset: predict not taken– Negative offset: predict taken

1. Prediction based on dynamic history


UCB Study [Lee and Smith, 1984]• Benchmarks used

– 26 programs (IBM 370, DEC PDP-11, CDC 6400)– 6 workloads (4 IBM, 1 DEC, 1 CDC)– Used trace-driven simulation

• Branch types– Unconditional: always taken or always not taken– Subroutine call: always taken– Loop control: usually taken– Decision: either way, if-then-else– Computed goto: always taken, with changing target– Supervisor call: always taken– Execute: always taken (IBM 370)

IBM1 IBM2 IBM3 IBM4 DEC CDC Avg

T 0.640 0.657 0.704 0.540 0.738 0.778 0.676

NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324

IBM1: compilerIBM2: cobol (business

app)IBM3: scientific

IBM4: supervisor (OS)


Branch Prediction Function

• Prediction function F(X1, X2, … )– X1 – opcode type– X2 – history

• Prediction effectiveness based on opcode only, or history

IBM1 IBM2 IBM3 IBM4 DEC CDC

Opcode only 66 69 71 55 80 78

History 0 64 64 70 54 74 78

History 1 92 95 87 80 97 82

History 2 93 97 91 83 98 91

History 3 94 97 91 84 98 94

History 4 95 97 92 84 98 95

History 5 95 97 92 84 98 96


Example Prediction Algorithm

• Hardware table remembers last 2 branch outcomes– History of past several branches encoded by FSM– Current state used to generate prediction

• Results:

T TT

N

T

N TT

T NT

T NT

N NN

N

T

T

N

T

N

T TT

Branch inst. Information Branch targetaddress for predict. address

Workload IBM1 IBM2 IBM3 IBM4 DEC CDC

Accuracy 93 97 91 83 98 91


Other Prediction Algorithms

• Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement.

N

TN

N

T

T NT

n ?

T

t

T

N

N

T

T NT

t ?

T

T N

n ?

tt ?

NN

nn


IBM Study [Nair, 1992]• Branch processing on the IBM RS/6000

– Separate branch functional unit– Five different branch types

• b: unconditional branch• bl: branch and link (subroutine calls)• bc: conditional branch• bcr: conditional branch using link register (returns)• bcc: conditional branch using count register

– Overlap of branch instructions with other instructions

• Zero cycle branches

– Two causes for branch stalls• Unresolved conditions• Branches downstream too close to unresolved branches


Number of History Bits Needed

• Branch history table size: Direct-mapped array of 2k entries• Some programs, like gcc, have over 7000 conditional branches• In collisions, multiple branches share the same predictor

– Constructive and destructive interference– Destructive interference

• Marginal gains beyond 1K entries (for these programs)

Prediction Accuracy (Overall CPI Overhead)

Benchmark 3 bit 2 bit 1 bit 0 bit

spice2g6 97.0 (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031)

doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022)

gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128)

espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176)

li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142)

eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049)


Branch Prediction Implementation (PPC 604)

Decode Buffer

Fetch

Dispatch Buffer

Decode

ReservationDispatch

Stations

Issue

Execute

Finish Completion

Branch

to I-cache

FA (fetch address)FABranch

Predictor

Spec. target

Prediction FA-mux

SFX SFX CFX FPU LSBRN

Buffer

BranchPredictorUpdate


BTAC and BHT Design (PPC 604)

Decode Buffer

Dispatch Buffer

Decode

ReservationDispatch

Stations

Issue

Execute

Finish Completion

Branch

FA

Branch TargetAddress Cache

FA-m

ux

Branch HistoryTable (BHT)

BTAC

BHT

SFX SFX CFX FPU LSBRN

Buffer

(BTAC)

I-cache

update

update

FA FA

FAR

+4

BTAC prediction

BHT prediction

BTAC:- 64 entries- fully associative- hit => predict taken

BHT:- 512 entries- direct mapped- 2-bit saturating counter history based prediction- overrides BTAC prediction


BTAC and BHT Design (PPC 604)


Advanced Branch Prediction• Control Flow Speculation

– Branch Speculation– Mis-speculation Recovery

• Branch Direction Prediction– Static Prediction– Dynamic Prediction– Hybrid Prediction

• Branch Target Prediction• High-bandwidth Fetch• High-Frequency Fetch


Dynamic Branch Prediction• Main advantages:

– Learn branch behavior autonomously• No compiler analysis, heuristics, or profiling

– Adapt to changing branch behavior• Program phase changes branch behavior

• First proposed in 1979-1980– US Patent #4,370,711, Branch predictor using

random access memory, James. E. Smith

• Continually refined since then


Smith Predictor Hardware

• Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135-148, May 1981

• Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc.

Branch Address

Branch Prediction

m

2m k-bit counters

mo st sign ifican t b it

Satur ating Cou nterIn crem ent/Decr emen t

Branch Outcome

Upd ated Cou nter Value


Two-level Branch Prediction

• BHR adds global branch history– Provides more context– Can differentiate multiple instances of the same static branch– Can correlate behavior across multiple static branches

BHR0110

PC = 01011010010101

010110

000000000001000010000011

010100010101010110010111

111110111111

PHT

1 0

1 Branch Prediction


Two-level Prediction: Local History

• Detailed local history can be useful

110

PC = 01011010010101

0101110

0000000000000100000100000011

0101100010110101011100101111

01111100111111

PHT

0 1

0 Branch Prediction

000001010011100101110111

BHT


Local History Predictor Example• Loop closing

branches– Must identify

last instance• Local history

dedicates PHT entry to each instance– ‘0111’ entry

predicts not taken

0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

11

11

11

00

11101110111011101110PHTLoop closing branch’s history


Two-level Taxonomy• Based on indices for branch history and

pattern history– BHR: {G,P,S}: {Global, Per-address, Set}– PHT: {g,p,s}: {Global, Per-address, Set}– 9 combinations: GAg, GAp, GAs, PAg, PAp,

PAs, SAg, SAp and SAs• Tse-Yu Yeh and Yale N. Patt. Two-Level

Adaptive Branch Prediction. International Symposium on Microarchitecture, pages 51-61, November 1991.


Index Sharing in Two-level Predictors

• Use XOR function to achieve better utilization of PHT• Scott McFarling. Combining Branch Predictors. TN-36,

Digital Equipment Corporation Western Research Laboratory, June 1993.

• Used in e.g. IBM Power 4, Alpha 21264

1101

0110

GAp

BHR

PC

1001

1001

1010

BHR

PC

1001

gshare

BHR

PC

1101

0110

1011XOR

BHR

PC

1001

1010

0011XOR

0000000100100011010001010110011110001001101010111100110111101111

0000000100100011010001010110011110001001101010111100110111101111


Branch Speculation

• Leading Speculation– Typically done during the Fetch stage– Based on potential branch instruction(s) in the current fetch

group• Trailing Confirmation

– Typically done during the Branch Execute stage– Based on the next Branch instruction to finish execution

NT T NT T NT TNT T

NT T NT T

NT T (TAG 1)

(TAG 2)

(TAG 3)


Branch Speculation

Leading Speculation1. Tag speculative instructions2. Advance branch and following instructions3. Buffer addresses of speculated branch

instructions Trailing Confirmation

1. When branch resolves, remove/deallocate speculation tag

2. Permit completion of branch and following instructions


Branch Speculation

• Start new correct path– Must remember the alternate (non-predicted) path

• Eliminate incorrect path– Must ensure that the mis-speculated instructions

produce no side effects

NT T NT T NT TNT T

NT T NT T

NT T

(TAG 2)

(TAG 3) (TAG 1)


Mis-speculation Recovery

Start new correct path1. Update PC with computed branch target (if predicted NT)

2. Update PC with sequential instruction address (if predicted T)

3. Can begin speculation again at next branch

Eliminate incorrect path1. Use tag(s) to deallocate ROB entries occupied by

speculative instructions

2. Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations


Tracking Instructions

• Assign branch tags– Allocated in circular order

– Instruction carries this tag throughout processor

• Track instruction groups– Instructions managed in groups, max. one

branch per group– ROB structured as groups

• Leads to some inefficiency

• Simpler tracking of speculative instructions


Mar 30, 2009 SE-273@SERC 71

Branch Mis-prediction Recovery


72

Thank You


superscalar processor design - ernet

Documents