readings lecture 13: modern superscalar...

1

1

Lecture 13: Modern Superscalar Pipelines

CprE 581 Computer Systems Architecture, Fall 2012

Readings

Textbook 3.8-3.13

“Complexity effective superscalar processors” (thesis) ch. 1 and 2,

http://jes.ece.wisc.edu/papers/isca97.subba.pdf

2

Copyright © 2012, Elsevier Inc. All rights reserved.

Multiple Issue and Static Scheduling

To achieve CPI < 1, need to complete multiple instructions per clock

Solutions: Statically scheduled superscalar

processors VLIW (very long instruction word)

processors dynamically scheduled superscalar

processors

Mu

ltiple

Issue

an

d S

tatic S

che

du

ling

Copyright © 2012, Elsevier Inc. All rights reserved.

Multiple Issue

Mu

ltiple

Issue

an

d S

tatic S

che

du

ling

2

5

Tomasulo PerformanceObserve at the EX stage, how many cycles to execute this code?

LW R2,45(R3)

ADD R6,R2,R4

SUB R10,R0,R6

ADD R10,R10,R12

Assume load takes 1 cycle, ALU 1 cycle

ReorderBufferDecode

FU1 FU2

RS RS

Fetch Unit

Rename

L-bufS-buf

DM

Regfile

IM

6

Tomasulo vs MIPS Pipeline

How many cycles on the 5-stage MIPS pipeline?

Why does the simple pipeline run faster?

IF

ID

EX

MEM

WB

Stall checkData forwarding

7

Review Tomasulo Inst SchedulingBoth in RS, no contention on CDB or FU

ADD R2,R2,45 # R2=>tag p, result = A SUB R6,R2,R4 # R4 is ready, = B

Cycle 1: ADD starts at FU, producing ACycle 2: ADD broadcast p + A

SUB matches on p and accepts ACycle 3: SUB starts execution, FU calc A-B

A is produced at cycle 1, but consumed at cycle 3 -- unavoidable?

8

Review Data ForwardingMIPS pipeline data

forwarding:FU/MEM => FUWhy not in Tomasulo?

Cycle 2: forward A from FU output to FU input…

FU

But tag broadcasting has one cycle delay!!

When is it known that A will be ready?

Cycle 1: A is to be readyCycle 2: A and its tag are

broadcast

If tag is broadcast one-cycle earlier …

REG/ROB

ROB bypass

3

9

Revised SchedulingRS1: ADD R6,R2,R4RS2: SUB R10,R0,R6RS3: ADD R12,R10,R6

ADD(1) has been ready and selected1. - ADD(1)’s tag is broadcast, and

operands are sent to FU; - SUB is waken up and selected;

2. - SUB’s tag is broadcast, operands are sent to FU; - forwarding logic replace 2nd FU operand with FU output; - ADD(2) is waken up and accepts FU output, and is selected

3. So on and so forth…

RS can be centralized or distributed

SELECT

RS1

RS2

RS3

RS4

RS5

FU

One cycle earlierHow to address CDB contention? 10

How to Handle Variable Latency?

SELECT

RS1

RS2

RS3

RS4

RS5

One method: Use result shift register to track latency and control tag/data bus

Tag broadcastCycle n+k-1

Control data bus Cycle n+k

FU of K-cycle latency

Cycle n

11

Revised Pipeline Stages

Fetc

h

Dis

patc

h/Re

nam

e

ROB

FUFUby

pass

D-c

ache

executecommit

Reg

Wakeupselect

• As efficient as MIPS pipeline (instruction throughput) • With data forwarding and bypassing

RS

SuperScalar Microarchitecture

FPU

Instructiondispatchbuses

FP operand busesGP operand buses

LSUMCFSU BPUXSU1XSU0

Reservationstations

FP registersGP registers

Completion #unit with

reorder buffer

Register nos.Register nos.

Register nos.

Register nos.

FP result busesGP result buses

Datacache

Instructioncache

Fetchunit

Dispatch unit with 8-entry

instruction queue#

Instructionoperationbuses

Result status buses

Branch correction Reorder buffer information

4

13Copyright © 2012, Elsevier Inc. All rights reserved.

Dyn

am

ic Sch

ed

ulin

g, M

ultip

le Issu

e, a

nd

Sp

ecu

latio

n

Overview of Design


Loop: LD R2,0(R1) ;R2=array element

DADDIU R2,R2,#1 ;increment R2

SD R2,0(R1) ;store result

DADDIU R1,R1,#8 ;increment pointer

BNE R2,R3,LOOP ;branch if not last element

Dyn

am

ic Sch

ed

ulin

g, M

ultip

le Issu

e, a

nd

Sp

ecu

latio

n

Example


Dyn

am

ic Sch

ed

ulin

g, M

ultip

le Issu

e, a

nd

Sp

ecu

latio

n

Example (No Speculation)


Dyn

am

ic Sch

ed

ulin

g, M

ultip

le Issu

e, a

nd

Sp

ecu

latio

n

Example

5

17

Dispatch Unit

opDS1S2

opDS1S2

opDS1S2

opDS1S2

CC

#comp = 2(k-1) + 2(k-2) + … + 2 = k(k-1)

Dispatch Stage- Dependence Check Logic

19

Dispatch Stage – Rename Table

20

6

21

Rethink RS and ROB designData broadcasting to

all RS stations:Broadcasting saves

reg-write to reg-read delayn child instructions

can receive data simultaneously

However,RS and ROB may store duplicate valuesNot all n child instructions may fu-execute next cycle

22

Rethink ROB DesignDoes every ROB entry store register output?

One solution: Split ROB into ROB + Rename Register

Arch. Regs Rename Regs ROB (no reg value)

From FU

Used in some PowerPC processors

23

Alternative: Register Mapping and Issue Queue

op Qj Qk Vj Vk

i-type

RS entry IQ entry

ROB entrydest resultPC valid

p1busyp2p3

p_n

Physical register: Central collection of all register values, architectural or temporary

Physical register

Arch. Regs:Virtual regs

removed

replaced

24

Register Mapping TableRename architecturalregister to physicalregisterNO real architectural registers (now virtual register)RS => issue queueRename stage: allocate issue queue entry, allocate ROB, allocate physical registerWhat is tag now?

p1p2p3

p_n

Mapping Table

ra rb rc pc

pa pb

papb

vala valb

freelist

alloc

7

25

Mis-speculation RecoveryRS+ROB: no changes to arch. registers, so just clear pipeline and re-fetchCorrectness issue: software does not see incorrect register contents

Recovery for mapping approach: Roll back mapping table to the mis-speculation point

Architectural registers=> virtual registers

p1p2p3

p_n

Committed mapping

mapping 1

mapping 2

ROB

Alpha 21264: One mapping table per instruction in issue queue, with selective flushingHow to implement mappingtable supporting recovery?

mapping tablestatus

26

“Data in ROB” Design(HP PA8000, Pentium Pro, Core2Duo)

• On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch)• On completion, write to dest field and broadcast to src fields.• On issue, read from ROB src fields

Register Fileholds only committed state

Reorderbuffer

LoadUnit FU FU FU Store

Unit

< t, result >

t1t2..tn

Ins# use exec op p1 src1 p2 src2 pd dest data

Commit

27

Unified Physical Register File(MIPS R10K, Alpha 21264, Pentium 4)

• One regfile for both committed and speculative values (no data in ROB)• During decode, instruction result allocated new physical register, source

regs translated to physical regs through rename table• Instruction reads data from regfile at start of execute (not in decode)• Write-back updates reg. busy bits on instructions in ROB (assoc. search)• Snapshots of rename table taken at every branch to recover mispredicts• On exception, renaming undone in reverse order of issue (MIPS R10000)

Rename Table

r1 tir2 tj

FU FU StoreUnit

< t, result >

FULoadUnit

FU

t1t2.tn

RegFileSnapshots for

mispredict recovery

(ROB not shown)

28

Pipeline Design with Physical Regfile

Fetch Decode & Rename Reorder BufferPC

BranchPrediction

Update predictors

Commit

BranchResolution

BranchUnit ALU MEM Store

Buffer D$

Execute

In-Order

In-OrderOut-of-Order

Physical Reg. File

kill

killkill

kill

8

29

Lifetime of Physical Registers

ld r1, (r3)add r3, r1, #4sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r11)

ld P1, (Px)add P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4st P6, (P1)ld P7, (Pw)

Rename

When can we reuse a physical register?When next write of same architectural register commits

• Physical regfile holds committed and speculative values• Physical registers decoupled from ROB entries (no data in ROB)

30

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

R5P5R6P6R7

R0P8R1

R2P7R3

R4

ROB

Rename Table

Physical Regs Free List

ld r1, 0(r3)add r3, r1, #4sub r6, r7, r6add r3, r3, r6ld r6, 0(r1)

ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<R1>P8 p

31


op p1 PR1 p2 PR2exuse Rd PRdLPRdROB


Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 r1 P0

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8

32




Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 r1 P0

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8P7

P1

x add P0 r3 P1

9

33




Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 r1 P0

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8P7

P1

x add P0 r3 P1P5

P3

x sub p P6 p P5 r6 P3

34


op p1PR1 p2PR2exuse Rd PRdLPRdROB


Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 r1 P0

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8P7

P1

x add P0 r3 P1P5

P3

x sub p P6 p P5 r6 P3P1

P2

x add P1 P3 r3 P2

9/13/2007 35




Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 r1 P0

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8P7

P1

x add P0 r3 P1P5

P3

x sub p P6 p P5 r6 P3P1

P2

x add P1 P3 r3 P2x ld P0 r6 P4P3

P4

36


x ld p P7 r1 P0x add P0 r3 P1x sub p P6 p P5 r6 P3

x ld p P7 r1 P0



Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<R1>

P8

x

10

37


x sub p P6 p P5 r6 P3x add P0 r3 P1x add P0 r3 P1



Free ListP0P1P3P2P4

<R6>P5<R7>P6<R3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 r1 P0

R5P5R6P6R7

R0P8R1

R2P7R3

R4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<R1>

P8

x

p

p<R3>

P7

38

Reorder Buffer HoldsActive Instruction Window

…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3

st r6, (r1)ld r6, (r1)…

(Older instructions)

(Newer instructions)

Cycle t

…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r1)…

Commit

Fetch

Cycle t + 1

Execute

39

Superscalar Register Renaming• During decode, instructions allocated new physical destination register• Source operands renamed to physical register with newest value• Execution unit only sees physical register numbers

Rename Table

Op Src1 Src2Dest Op Src1 Src2Dest

Register Free List

Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest

UpdateMapping

Does this work?

Inst 1 Inst 2

Read Addresses

Read DataWrite

Po

rts

40

Superscalar Register Renaming

Rename Table

Op Src1 Src2Dest Op Src1 Src2Dest

Register Free List

Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest

UpdateMapping

Inst 1 Inst 2

Read Addresses

Read DataWrite

Po

rts =?=?

Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename lookup.

MIPS R10K renames 4 serially-RAW-dependent insts/cycle

11

41

Scheduling with Issue Queue

FETCH

RENAME

REG

SCHEDULE

COMMIT

EXE/WB

ROB

Decode

FU1 FU2

Fetch Unit

Rename

L-bufS-buf

DM

phy. regfile

IM

Issue Queue

42

Alpha 21264 Pipeline

43

Examples: Intel P6

Fetch

Decode

Rename

ROB Rd

Use RS + ROB

• 40-entry ROB

• 20-entry RS station

• Register Alias Table

…

…

44

Example: Intel Pentium 4

128entries

AllocRenameRenameQueueSchdSchdSchdDispDispRegRegEx

Use issue queue + phy. regs

12

45

Generic Superscalar Processor Models

Fetc

h

Rena

me

Wak

eup

sele

ct

Regf

ile FUFUby

pass

D-c

ache

executecommit

Fetc

h

Rena

me

ROB

FUFUby

pass

D-c

ache

executecommit

Reg

Wakeupselect

Issue queue based

Reservation based

Source: Paracharla PhD thesis 199846

Dynamic Scheduling in P6 (Pentium Pro, II, III)

Q: How to pipeline 1 to 17 byte, 80x86 instructions?P6 doesn’t pipeline 80x86 instructionsP6 decode unit translates the Intel instructions into

72-bit micro-operations (~ MIPS)Sends micro-operations to reorder buffer &

reservation stationsMany instructions translate to 1 to 4 micro-operationsComplex 80x86 instructions are executed by a

conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

14 clocks in total pipeline (~ 3 state machines)

47 48

Pentium III Die PhotoEBL/BBL - Bus logic, Front, BackMOB - Memory Order BufferPacked FPU - MMX Fl. Pt. (SSE)IEU - Integer Execution UnitFAU - Fl. Pt. Arithmetic UnitMIU - Memory Interface UnitDCU - Data Cache UnitPMH - Page Miss HandlerDTLB - Data TLBBAC - Branch Address CalculatorRAT - Register Alias TableSIMD - Packed Fl. Pt.RS - Reservation StationBTB - Branch Target BufferIFU - Instruction Fetch Unit (+I$)ID - Instruction DecodeROB - Reorder BufferMS - Micro-instruction Sequencer1st Pentium III, Katmai: 9.5 M transistors, 12.3 *

10.4 mm in 0.25-mi. with 5 layers of aluminum

13

49

AMD AlthonSimilar to P6 microarchitecture (Pentium III), but more resourcesTransistors: PIII 24M v. Althon 37MDie Size: 106 mm2 v. 117 mm2

Power: 30W v. 76WCache: 16K/16K/256K v. 64K/64K/256KWindow size: 40 vs. 72 uopsRename registers: 40 v. 36 int +36 Fl. Pt.BTB: 512 x 2 v. 4096 x 2Pipeline: 10-12 stages v. 9-11 stagesClock rate: 1.0 GHz v. 1.2 GHzMemory bandwidth: 1.06 GB/s v. 2.12 GB/s

50

Pentium 4Still translate from 80x86 to micro-opsP4 has better branch predictor, more FUsInstruction Cache holds micro-operations vs. 80x86 instructions no decode stages of 80x86 on cache hit called “trace cache” (TC)

Faster memory bus: 400 MHz v. 133 MHzCaches Pentium III: L1I 16KB, L1D 16KB, L2 256 KB Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock

Clock rates: Pentium III 1 GHz v. Pentium IV 1.5 GHz

51

Pentium 4 featuresMultimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions When used by programs? Faster Floating Point: execute 2 64-bit FP Per clock Memory FU: 1 128-bit load, 1 128-store /clock to

MMX regsSupporting RAMBUS DRAM Bandwidth faster, latency same as SDRAM Cost 2X-3X vs. SDRAM

ALUs operate at 2X clock rate for many opsPipeline doesn’t stall at this clock rate: uops replayRename registers: 40 vs. 128; Window: 40 v. 126BTB: 512 vs. 4096 entries (Intel: 1/3 improvement)

52

Basic Pentium 4 Pipeline

1-2 trace cache next instruction pointer

3-4 fetch uops from Trace Cache

5 drive upos to alloc6 alloc resources (ROB,

reg, …)7-8 rename arch. reg to

128 physical reg9 put renamed uops into

queue

10-12 write uops into scheduler

13-14 move up to 6 uops to FU

15-16 read registers17 FU execution18 computer flags e.g. for

branch instructions19 check branch output

with branch prediction20 drive branch check

result to frontend

TC Nxt IP DriveTC Fetch Alloc Rename Queue Schd

Schd Schd Disp Disp Reg Reg Ex Flags Br Chk Drive

14

53

Block Diagram of Pentium 4 Microarchitecture

BTB = Branch Target Buffer (branch predictor)I-TLB = Instruction TLB, Trace Cache = Instruction cacheRF = Register File; AGU = Address Generation Unit"Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s

From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00 54

Pentium 4 Die Photo42M Xtors PIII: 26M

217 mm2

PIII: 106 mm2

L1 Execution Cache Buffer

12,000 Micro-Ops

8KB data cache256KB L2$

ARM Cortex-A8 Pipeline

55

ARM Cortex-A8

Dual-issue, statically scheduled superscalarDynamic branch predictor (512 entry, 2-way set-associative BTB, 4K global history buffer – indexed by BHR+PC13 stage pipeline3 cycles – IF, 4 cycles – ID, 5 cycles int pipeline

56

15

Copyright © 2011, Elsevier Inc. All rights Reserved.

five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.


The five-stage instruction decode of the A8. Multiply operations are always performed in ALU pipeline 0.

ARM Cortex-A8: Neon pipeline

59Copyright © 2011, Elsevier Inc.

All rights Reserved.

The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.

16


The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.

Intel CoreI7 Pipeline

Intel Core i7

Out-of-order speculative µ-architecture.IF – multilevel branch prediction. RA stack. Misprediction penalty 15 cycles. Fetches 16 bytes from I-cache16B in predecode inst buffer. Macro-op fusion fuses inst such as compare followed by a branch into a single op.

62

Intel Core i7

Predecode breaks 16B chunk into x86 inst.µ-op decode – 3 simple decoders, one complex decoder reducing x86 instinto upto 4 µ-ops.Loop-stream detection (micro-fusion) for loops of less than 28 inst or 256B.Inst issue.

63

Intel Core i7

i7 uses 36-entry centralized RS shared by 6 FUs. Upto 6 µ-ops can be selected per cycle.Results sent to RS and reg retirement unit.

64

17


The amount of “wasted work” is plotted by taking the ratio of dispatched micro-ops that do not graduate to all dispatched micro-ops. For example, the ratio is 25% for sjeng, meaning that 25% of the dispatched and executed micro-ops are thrown away.


The CPI for the 19 SPECCPU2006 benchmarks shows an average CPI for 0.83 for both the FP and integer benchmarks, although the behavior is quite different. In the integer case, the CPI values range from 0.44 to 2.66 with a standard deviation of 0.77, while the variation in the FP case is from 0.62 to 1.38 with a standard deviation of 0.25.

67

Workstation Microprocessors 3/2001

Source: Microprocessor Report, www.MPRonline.com

Max issue: 4 instructions (many CPUs)Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size (OOO): 126 intructions (Pent. 4)Max Pipeline: 22/24 stages (Pentium 4)

68

SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com

1.6X

3.8X

1.2X

1.7X

1.5X

18

69

Benchmarks: Pentium 4 v. PIII v. AlthonSPECbase2000 Int, [email protected] GHz: 524, PIII@1GHz: 454, AMD [email protected]:? FP, [email protected] GHz: 549, PIII@1GHz: 329, AMD

[email protected]:304WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) P4 : 164, PIII : 167, AMD Althon: 180

Quake 3 Arena: P4 172, Althon 151SYSmark 2000 composite: P4 209, Althon 221Office productivity: P4 197, Althon 209S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed."

70

Summary of Dynamic SchedulingPipeline stages Renaming (in-order) Schedule Commit (in-order)

Two organizations Mapping table + phy reg +

issue queue + ROB;REN => SCHD => REG

Reg alias table + RS + ROB, reg in RS and ROB;REN => REG => SCHD

Scheduling methods Tag broadcasting vs.

scoreboarding (later)

CDC6600: introduces scoreboarding

Tomasulo: introduces renaming and tag broadcasting

Reorder buffer: provides in-order commit

Real OOO processors very complicated (like a

vehicle) bring impl variants but all root in those basic

designs

readings lecture 13: modern superscalar...

Documents