lecture 2: pipelining and superscalar review. motivation: increase throughput with little increase...

Advanced MicroarchitectureLecture 2: Pipelining and Superscalar Review

Pipelined Design• Motivation: Increase throughput with little

increase in cost (hardware, power, complexity, etc.)

• Bandwidth or Throughput = Performance• BW = num. tasks/unit time• For a system that operates on one task at a

time:BW = 1 / latency

• Pipelining can increase BW if many repetitions of same operation/task

• Latency per task remains same or increasesLecture 2: Pipelining and Superscalar Review

Pipelining Illustrated

Lecture 2: Pipelining and Superscalar Review

Combinatorial LogicN Gate Delays

BW = ~(1/n)

Combinatorial LogicN/2 Gate Delays

Comb. LogicN/3 Gates

BW = ~(2/n)

BW = ~(3/n)

Performance Model• Starting from an

unpipelined version with propagation delay T and BW=1/T

Perfpipe = BWpipe =1 / (T/k + S)

where S = latch delaywhere k = num stages

k-stage pipelinedunpipelined

Hardware Cost Model• Starting from an

unpipelined version with hardware cost G

Costpipe = G + kL

where L = latch cost incl. controlwhere k = num stages

k-stage pipelinedunpipelined

Cost/Performance Tradeoff

Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)

= LT + GS + LSk + GT/k

Optimal Cost/Performance: find min. C/P w.r.t. choice of k

è øç ÷

ç ÷ç ÷æ ö

koptGTLS--------=

çç ç

Lk + G1

= 0 + 0 + LS -GTk2

“Optimal” Pipeline Depth: kopt

0 10 20 30 40 50Pipeline Depth k

x104Co

G=175, L=41, T=400, S=22

G=175, L=21, T=400, S=11

Cost?• “Hardware Cost”

– Transistor/Gate Count• Should include additional logic to control the pipeline

– Area (related to gate count)– Power!

• More gates more switching• More gates more leakage

• Many metrics to optimize• Very difficult to determine what really is

“optimal”

Pipelining Idealism• Uniform Suboperations

– The operation to be pipelined can be evenly partitioned into uniform-latency suboperations

• Repetition of Identical Operations– The same operations are to be performed

repeatedly on a large number of different inputs• Repetition of Independent Operations

– All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts

Good Examples:Automobile assembly lineFloating-Point multiplierInstruction pipeline (?)

Instruction Pipeline Design• Uniform Suboperations … NOT!

– Balance pipeline stages• Stage quantization to yield balanced stages• Minimize internal fragmentation (some waiting stages)

• Identical operations … NOT!– Unifying instruction types

• Coalescing instruction types into one “multi-function” pipe• Minimize external fragmentation (some idling stages)

• Independent operations … NOT!– Resolve data and resource hazards

• Inter-instruction dependency detection and resolution• Minimize performance loss

The Generic Instruction Cycle• The “computation” to be pipelined:

1. Instruction Fetch (IF)2. Instruction Decode (ID)3.Operand(s) Fetch (OF)4. Instruction Execution (EX)5.Operand Store (OS)

• a.k.a. writeback (WB)6.Update Program Counter (PC)

The Generic Instruction Pipeline

Based on Obvious Subcomputations:Instruction Fetch

Instruction Decode

Operand Fetch

Instruction Execute

Operand Store

Balancing Pipeline Stages

TIF= 6 units

TID= 2 units

TID= 9 units

TEX= 5 units

TOS= 9 units

• Without pipeliningTcyc TIF+TID+TOF+TEX+TOS

• PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}

Speedup= 31 / 9

Can we do better in terms of either performance or efficiency?

Balancing Pipeline Stages• Two methods for stage quantization

– Merging multiple subcomputations into one– Subdividing a subcomputation into multiple

smaller ones

• Recent/Current trends– Deeper pipelines (more and more stages)

• To a certain point: then cost function takes over– Multiple different pipelines/subpipelines– Pipelining of memory accesses (tricky)

Granularity of Pipeline Stages

Coarser-Grained Machine Cycle: 4 machine cyc /

instructionTIF&ID= 8 units

TOF= 9 units

TEX= 5 units

TOS= 9 units

Finer-Grained Machine Cycle: 11 machine cyc

/instruction

Tcyc= 3 units

TIF,TID,TOF,TEX,TOS = (6/2/9/5/9)

IFIFID

OFOFOFEXEX

OSOSOS

Hardware Requirements• Logic needed for

each pipeline stage

• Register file ports needed to support all (relevant) stages

• Memory accessing ports needed to support all (relevant) stages

IFIFID

OFOFOFEXEX

OSOSOS

Pipeline Examples

PC GENCache ReadCache Read

DecodeRead REGAdd GEN

Cache ReadCache Read

EX 1EX 2

Check ResultWrite Result

IFMIPS R2000/R3000

AMDAHL 470V/7

Instruction Dependencies• Data Dependence

– True Dependence (RAW)• Instruction must wait for all required input operands

– Anti-Dependence (WAR)• Later write must not clobber a still-pending earlier read

– Output Dependence (WAW)• Earlier write must not clobber an already-finished later write

• Control Dependence (a.k.a. Procedural Dependence)– Conditional branches cause uncertainty to instruction

sequencing– Instructions following a conditional branch depends on

the execution of the branch instruction– Instructions following a computed branch depends on the

execution of the branch instruction

Example: Quick Sort on MIPS

Lecture 2: Pipelining and Superscalar Review 19

bge $10, $9, $36mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge $25, $15, $36

$35:addu $10, $10, 1. . .

$36:addu $11, $11, -1. . .

# for (;(j<high)&&(array[j]<array[low]);++j);# $10 = j; $9 = high; $6 = array; $8 = low

Hardware Dependency Analysis• Processor must handle

– Register Data Dependencies• RAW, WAW, WAR

– Memory Data Dependencies• RAW, WAW, WAR

– Control Dependencies

Terminology• Pipeline Hazards:

– Potential violations of program dependencies– Must ensure program dependencies are not

violated• Hazard Resolution:

– Static method: performed at compile time in software

– Dynamic method: performed at runtime using hardware

Stall, Flush or Forward

• Pipeline Interlock:– Hardware mechanism for dynamic hazard

resolution– Must detect and enforce dependencies at

runtime

Pipeline: Steady State

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM

IF ID RD ALUIF ID RD

IF IDIF

t0 t1 t2 t3 t4 t5

Instj+1

Instj+2

Instj+3

Instj+4

Pipeline: Data Hazard

t0 t1 t2 t3 t4 t5

IF IDIF

Instj+1

Instj+2

Instj+3

Instj+4

Pipeline: Stall on Data Hazard

IF ID Stalled in RD ALU MEM WBIF Stalled in ID RD ALU MEM WB

Stalled in IF ID RD ALU MEMIF ID RD ALU

t0 t1 t2 t3 t4 t5

Instj+1

Instj+2

Instj+3

Instj+4

RDIDIF

IF ID RDIF ID

Different Viewt0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF Ij Ij+1 Ij+2 Ij+3 Ij+4 Stall Ij+4

ID Ij Ij+1 Ij+2 Ij+3 Stall Ij+3 Ij+4

RD Ij Ij+1 Ij+2 Stall Ij+2 Ij+3 Ij+4

ALU Ij Ij+1 nop nop nop Ij+2 Ij+3 Ij+4

MEM Ij Ij+1 nop nop nop Ij+2 Ij+3

WB Ij Ij+1 nop nop nop Ij+2

Pipeline: Forwarding Paths

IF IDIF

t0 t1 t2 t3 t4 t5

Many possible pathsInstj

Instj+1

Instj+2

Instj+3

Instj+4

MEM ALURequires stalling even with fwding paths

ALU Forwarding Paths

Deeper pipeline mayrequire additionalforwarding paths

IF ID Register Filesrc1src2

Pipeline: Control Hazard

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

IF IDIF

Pipeline: Stall on Control Hazard

IF ID RD ALU MEMIF ID RD ALU

IF ID RDIF ID

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

Stalled in IF

Pipeline: Prediction for Control Hazards

Lecture 2: Pipelining and Superscalar Review 30

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

IF ID RD ALU nop nopIF ID RD nop nop

IF ID nop nopIF ID RD

IF IDIF

nopnop nopALU nopRD ALUID RD

nopnopnop

New Insti+2New Insti+3New Insti+4

Speculative State Cleared

Fetch Resteered

Going Beyond Scalar• Simple pipeline limited to execution of CPI

≥ 1.0• “Superscalar” can achieve CPI ≤ 1.0 (i.e.,

IPC ≥ 1.0)– Superscalar means executing more than one

scalar instruction in parallel (e.g., add + xor + mul)

– Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions)

Architectures for Instruction Parallelism• Scalar pipeline (baseline)

– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1

Time in cycles

1 2 3 4 5 6 7 8 9 10 11 12

D different instructions overlapped

Superscalar Machine• Superscalar (pipelined) Execution

– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle

Time in cycles

1 2 3 4 5 6 7 8 9 10 11 12

D x N different instructions overlapped

Ex. Original Pentium

Prefetch

Decode1

Decode2 Decode2

Execute Execute

WritebackWriteback

4× 32-byte buffers

Decode up to 2 insts

Read operands, Addr comp

Asymmetric pipesu-pipe v-pipe

shiftrotate

some FP

jmp, jcc,call,fxch

Bothmov, lea,

simple ALU,push/poptest/cmp

Pentium Hazards, Stalls• “Pairing Rules” (when can/can’t two insts exec at the

same time?)– read/flow dependence

mov eax, 8mov [ebp], eax

– output dependencemov eax, 8

mov eax, [ebp]– partial register stalls

mov al, 1mov ah, 0

– function unit rules• some instructions can never be paired:

MUL, DIV, PUSHA, MOVS, some FP

Limitations of In-Order Pipelines• CPI of inorder pipelines degrades very

sharply if the machine parallelism is increased beyond a certain point– i.e., when N approaches the average distance

between dependent instructions– Forwarding is no longer effectiveMust stall more oftenPipeline may never be full due to frequency of

dependency stalls

N Instruction Limit

Ex. Superscalar degree N = 4

Any dependencybetween theseinstructions willcause a stall

Dependent instmust be N =

4 instructionsaway

On average, the parent-child separation is onlyabout 5± instructions!

(Franklin and Sohi ’92)

Pentium: Superscalar degree N=2is reasonable… going much further

encounters rapidly diminishing returns

Average of 5 means there are many cases when the

separation is < 4… each of these limits parallelism

In Search of Parallelism• “Trivial” Parallelism is limited

– What is trivial parallelism?• In-order: sequential instructions do not have

dependencies• in all previous examples, all instructions executed

either at the same time or after earlier instructions– previous slides show that superscalar execution

quickly hits a ceiling

• So what is “non-trivial” parallelism? …

What is Parallelism?• Work

T1: time to complete a computation on a sequential system

• Critical PathT: time to complete the same

computation on an infinitely-parallel system

• Average ParallelismPavg = T1/ T

• For a p-wide systemTp max{T1/p , T}Pavg >> p Tp T1/p

x = a + b; y = b * 2z =(x-y) * (x+y)

ILP: Instruction-Level Parallelism• ILP is a measure of the amount of inter-

dependencies between instructions• Average ILP = num instructions / longest

pathcode1: ILP = 1 (must execute serially)

T1 = 3, T = 3code2: ILP = 3 (can execute at the same time)

T1 = 3, T = 1

code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3

code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10

ILP != IPC• Instruction level parallelism usually

assumes infinite resources, perfect fetch, and unit-latency for all instructions

• ILP is more a property of the program dataflow

• IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine

• The ILP of a program is an upper-bound on the attainable IPC

Scope of ILP Analysis

r1 r2 + 1r3 r1 / 17r4 r0 - r3

r11 r12 + 1r13 r19 / 17r14 r0 - r20

ILP=2ILP=1

DFG AnalysisA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]

In-Order Issue, Out-of-Order Completion

Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard

Issue = send an instructionto execution

INT Fadd1

In-orderInst.

Stream

ExecutionBegins

In-order

Out-of-orderCompletion

Example

A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]

A BCycle 1:

E F6: GH JK

IPC = 10/8 = 1.25

Example (2)

A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 – 1K: R3 ST 0[R1]

A BCycle 1:

IPC = 10/7 = 1.43

Track with Simple Scoreboarding• Scoreboard: a bit-array, 1-bit for each GPR

– If the bit is not set: the register has valid data– If the bit is set: the register has stale data

i.e., some outstanding instruction is going to change it• Issue in Order: RD Fn (RS, RT)

– If SB[RS] or SB[RT] is set RAW, stall– If SB[RD] is set WAW, stall– Else, dispatch to FU (Fn) and set SB[RD]

• Complete out-of-order– Update GPR[RD], clear SB[RD]

Out-of-Order Issue

INT Fadd1

In-orderInst.

Stream

DR DR DR DR

Out-of-orderCompletion

Out ofProgram

OrderExecution

Need an extraStage/buffers forDependencyResolution

OOO Scoreboarding• Similar to In-Order scoreboarding

– Need new tables to track status of individual instructions and functional units

– Still enforce dependencies• Stall dispatch on WAW• Stall issue on RAW• Stall completion on WAR

• Limitations of Scoreboarding?• Hints

– No structural hazards– Can always write a RAW-free code sequence

Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …– Think about x86 ISA with only 8 registers

Finite number of registers inany ISA will force you to reuseregister names at some point

WAR, WAW stalls

Lessons thus Far• More out-of-orderness More ILP exposed

But more hazards• Stalling is a generic technique to ensure

sequencing• RAW stall is a fundamental requirement (?)

• Compiler analysis and scheduling can help(not covered in this course)

Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]

Floating Point

Registers FLR

Buffers SDB

Control

Decoder

Floating

Operand

FLOSControl

Floating Point

Buffers FLB

Decoder

Floating PointRegisters (FLR)

Control

Floating

Operand Stack

Floating Point

Buffers (FLB)

123456

StoreData

Buffers (SDB)

Control

Storage Bus Instruction Unit

Result

Multiply/Divide

Common Data Bus (CDB)

BusyBits

FLB BusFLR Bus

CDB ••

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

Result

(FLOS)

FYI: Historical Note• Tomasulo’s algorithm (1967) was not the

first• Also at IBM, Lynn Conway proposed multi-

issue dynamic instruction scheduling (OOO) in Feb 1966– Ideas got buried due to internal politics,

changing project goals, etc.– But it’s still the first (as far as I know)

Modern Enhancements to Tomasulo’s Algorithm

TomasuloPeak IPC = 12 FP FU’sSingle CDBOperand copyingRS TagTag-based forwardingImprecise

ModernPeak IPC = 6+6-10+ FU’sMany forwarding busesRenamed registersRenamed registersTag-based forwardingPrecise (requires ROB)

Machine WidthStructural Deps

Anti-DepsOutput-DepsTrue DepsExceptions

lecture 2: pipelining and superscalar review. motivation: increase throughput with little increase...

Documents

chapter 5 superscalar techniques. superscalar techniques ...

superscalar processors superscalar processors: branch...

superscalar processors

architecture basics ece 454 computer systems programming...

parallelrechner -...

pipelining and retiming 1 pipelining adding registers along...

superscalar - summary

architecture lecture 1 - introduction · cisc 662 graduate...

comp superscalar: bringing grid superscalar and gcm together

14 superscalar

superscalar architecture_aiub

superscalar architectures

Υπερβαθμωτή (superscalar) ΟάΟργάνωση...

superscalar design prime

csl718 : superscalar processors

14 superscalar processors

lecture 6: superscalar decode and other pipelining

7 a3 instruction-level parallel processing: 2 history...

this unit: static & dynamic...

ejercicio superscalar execution