instruction-level parallelism (ilp): speculation, reorder ...midor/ese545/ca_ilp...

1ILP: speculation and multiple-instruction issue

ESE 545 Computer Architecture

Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Computer Architecture

2

Review from Last Lecture Leverage Implicit Parallelism for Performance:

Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP Dynamic HW exploiting ILP

Works when can’t know dependence at compile time

Can hide L1 cache misses Code for one machine runs well on another

ILP: speculation and multiple-instruction issue

3

Review from Last lecture Reservations stations: renaming to larger set of registers + buffering

source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW

Not limited to basic blocks (integer units gets ahead, beyond branches)

Helps cache misses as well Lasting Contributions

Dynamic scheduling Register renaming Load/store disambiguation

360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

…


4

Speculation to Greater ILP Greater ILP: Overcome control dependence

by hardware speculating on outcome of branches and executing program as if guesses were correct Speculation fetch, issue, and execute

instructions as if branch predictions were always correct

Dynamic scheduling only fetches and issues instructions

Essentially a data flow execution model: Operations execute as soon as their operands are available


5

Speculation to Greater ILP 3 components of HW-based speculation:1. Dynamic branch prediction to choose which

instructions to execute 2. Speculation to allow execution of instructions

before control dependences are resolved + ability to undo effects of incorrectly

speculated sequence 3. Dynamic scheduling to deal with scheduling of

different combinations of basic blocks


6

Adding Speculation to Tomasulo Must separate execution from allowing instruction

to finish or “commit” This additional step called instruction commit When an instruction is no longer speculative, allow

it to update the register file or memory Requires additional set of buffers to hold results of

instructions that have finished execution but have not committed

This reorder buffer (ROB) is also used to pass results among instructions that may be speculated


7

Phases of Instruction ExecutionFetch: Instruction bits retrieved from instruction cache.I‐cache

Fetch Buffer

Issue Buffer

Functional Units

ArchitecturalState

Execute: Instructions and operands issued to functional units. When execution completes, all results and exception flags are available.

Decode: Instructions dispatched to appropriate issue buffer

Reorder BufferCommit: Instruction irrevocably updates architectural state (aka “graduation”), or takes precise trap/interrupt.

PC

Commit

Decode/Rename


8

Reorder Buffer (ROB) In Tomasulo’s algorithm, once an instruction writes its

result, any subsequently issued instructions will find result in the register file

With speculation, the register file is not updated until the instruction commits (we know definitively that the instruction should

execute) Thus, the ROB supplies operands in interval between

completion of instruction execution and instruction commit ROB is a source of operands for instructions, just as

reservation stations (RS) provide operands in Tomasulo’s algorithm

ROB extends architectured registers like RSILP: speculation and multiple-instruction issue

9

Separating Completion from Commit Re-order buffer holds register results from

completion until commit Entries allocated in program order during decode Buffers completed values and exception state until in-order

commit point Completed values can be used by dependents before committed

(bypassing) Each entry holds program counter, instruction type, destination

register specifier and value if any, and exception status (info often compressed to save hardware)

Memory reordering needs special data structures Speculative store address and data buffers Speculative load address and data buffers


10

Reorder Buffer Entry Each entry in the ROB contains four fields: 1. Instruction type

• a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations)

2. Destination• Register number (for loads and ALU operations) or

memory address (for stores) where the instruction result should be written

3. Value• Value of instruction result until the instruction commits

4. Ready• Indicates that instruction has completed execution, and the value

is ready


11

Reorder Buffer Operation Holds instructions in FIFO order, exactly as issued When instructions complete, results placed into ROB

Supplies operands to other instruction between execution complete & commit more registers like RS

Tag results with ROB buffer number instead of reservation station Instructions commit values at head of ROB placed in registers As a result, easy to undo

speculated instructions on mispredicted branches or on exceptions Reorder

BufferFPOp

Queue

FP Adder FP AdderRes Stations Res Stations

FP Regs

Commit path


12

4 Steps of Speculative Tomasulo Algorithm1. Issue—get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2. Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)


13

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) N

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers


14

2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F10F0

ADDD F10,F4,F0LD F0,10(R2)

NN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers


15

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers


16

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0 ADDD F0,F4,F6 NF4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

5 0+R3


17

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

ROB5 ST 0(R3),F4ADDD F0,F4,F6

NN

F4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

1 10+R25 0+R3


18

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)


ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

M[10] ST 0(R3),F4ADDD F0,F4,F6

YN

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)


19



ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

M[10]<val2>

ST 0(R3),F4ADDD F0,F4,F6

YEx

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers


20

--F0

M[10]<val2>

ST 0(R3),F4ADDD F0,F4,F6

YEx

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> N



ToMemory



FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F10F0


NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

What about memoryhazards???


21

Avoiding Memory Hazards WAW and WAR hazards through memory are eliminated

with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending

RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its

execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and

2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.

these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data


22

Exceptions and Interrupts Interrupts and Exceptions either interrupt the current instruction or

happen between instructions Possibly large quantities of state must be saved before interrupting

Machines with precise exceptions provide one single point in the program to restart execution All instructions before that point have completed No instructions after or including that point have completed

IBM 360/91 invented “imprecise interrupts” Computer stopped at this PC; its likely close to this address Not so popular with programmers Also, what about Virtual Memory? (Not in IBM 360)

Technique for both precise interrupts/exceptions and speculation: in-order issue, and in-order commit If we speculate and are wrong, need to back up and restart execution to

point at which we predicted incorrectly This is exactly same as need to do with precise exceptions

Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB If a speculated instruction raises an exception, the exception is recorded in

the ROB This is why reorder buffers are used in all new processors!


23

“Data-in-ROB” Design(HP PA8000, Pentium Pro, Core2Duo, Nehalem)

Managed as circular buffer in program order, new instructions dispatched to free slots, oldest instruction committed/reclaimed when done (“p” bit set on result)

Tag is given by index in ROB (Free pointer value) In dispatch, non-busy source operands read from architectural register

file and copied to Src1 and Src2 with presence bit “p” set. Busy operands copy tag of producer and clear “p” bit.

Set valid bit “v” on dispatch, set issued bit “i” on issue On completion, search source tags, set “p” bit and copy data into src

on tag match. Write result and exception flags to ROB. On commit, check exception status, and copy result into architectural

register file if no trap. On trap, flush machine and ROB, set free=oldest, jump to handler

Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode

Oldest

Free

24

Managing Rename for Data-in-ROB

If “p” bit set, then use value in architectural register file Else, tag field indicates instruction that will/has produced value For dispatch, read source operands <p,tag,value> from arch.

regfile, then also read <p,result> from producing instruction in ROB at tag index, bypassing as needed. Copy operands to ROB.

Write destination arch. register entry with <0,Free,_>, to assign tag to ROB index of this instruction

On commit, update arch. regfile with <1, _, Result> On trap, reset table (All p=1)

Tagp ValueTagp ValueTagp Value

Tagp Value

One entry per arch. register

Rename table associated with architectural registers, managed in decode/dispatch


25

ROB

Data Movement in Data-in-ROB DesignArchitectural Register

FileRead operands during decode

Read operands at issue

Functional Units

Read results for commit

Bypass newer values at dispatch

Result Data

Write results at completion

Write results at commit

Source Operands

Write sources in dispatch


26

Unified Physical Register File(MIPS R10K, Alpha 21264, Intel Pentium 4 & Sandy/Ivy Bridge) Rename all architectural registers into a single physical

register file during decode, no register values read Functional units read and write from single unified

register file holding committed and temporary registers in execute

Commit only updates mapping of architectural register to physical register, no data movement

Unified Physical Register File

Read operands at issue

Functional Units

Write results at completion

Committed Register Mapping

Decode Stage Register Mapping


27

Lifetime of Physical Registers

ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)

ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)

Rename

When can we reuse a physical register?When next writer of same architectural register commits

• Physical regfile holds committed and speculative values• Physical registers decoupled from ROB entries (no data in ROB)


28

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

x5P5x6P6x7

x0P8x1

x2P7x3

x4

ROB

Rename Table

Physical Regs Free List

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<x1>P8 p


29


op p1 PR1 p2 PR2exuse Rd PRdLPRdROB


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8


30




Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1


31




Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3


32




Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2


33




Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4


34



x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3

x ld p P7 x1 P0


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<x1>

P8

x


35



x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<x1>

P8

x

p

p<x3>

P7


36

Review: Improving the 5-stage Pipeline Performance


• Higher clock frequency (lower CCT): deeper pipelines – Overlap more instructions

• Higher CPIideal: wider pipelines – Insert multiple instruction in parallel in the pipeline • Lower CPIstall:

– Diversified pipelines for different functional units – Out-of-order execution

• Balance conflicting goals – Deeper & wider pipelines -> more control hazards – Branch prediction

• It is all works because of instruction-level parallelism (ILP)

37

Instruction-level Parallelism (ILP)


38

Deeper Pipelines


39

Deeper Pipelines Summary Advantages: higher clock frequency

– The workhorse behind multi-GHz processors – Opteron: 11, UltraSparc: 14, Power5: 17, Pentium4: 22/34

Cost – Complexity: more forwarding & stall cases

Disadvantages – More overlapping -> more dependencies -> more stalls

• CPIstall grows due to data and control hazards – Clock overhead becomes increasingly important – Power consumption


40

Limits to Pipeline Depth Each pipeline stage introduces some overhead (O)

- Delay of pipeline registers – Inequalities in work per stage

• Cannot break up work into stages at arbitrary points – Clock skew

• Clocks to different registers may not be perfectly aligned

If original CCT was T, with N stages CCT is T/N+O – If N -> ∞, speedup = T / (T/N+O) -> T/O

• Assuming that IC and CPI stay constant – Eventually overhead dominates and deeper pipelines

have diminishing returns


41

Wide or Superscalar Pipelines


42

Wide Pipelines Summary Advantages: lower CPIideal (1/N)

– Opteron: 3, UltraSparc: 4, Power5: 4, Pentium4: 3 Cost

– Need wider path to instruction cache– Need more ALUs, more register file ports, …– Complexity: more forwarding & stall cases to check

Disadvantages– Parallel execution -> more dependencies ->more stalls

• CPIstall grows due to data and control hazards


43

Diversified Pipelines w/o Reorder Buffers


44

A Modern Superscalar Out-of-Order Processor


45

Branch Penalty in Superscalar OOO Processors


46

The Challenges for Superscalar Processors Clock frequency: getting close to pipelining limits?

– Clocking overheads, CPI degradation Branch prediction & memory latency

– Limit the practical benefits of out-of-order execution Power consumption

– Gets worse with higher clock & more OOO logic Design complexity

– Grows exponentially with issue width Limited ILP?

• Alternative: single-chip (multicore) multiprocessors


47

Multiple-instruction Issue:Getting CPI below 1 CPI ≥ 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors:

1. statically-scheduled superscalar processors,2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors

2 types of superscalar processors issue varying numbers of instructions per clock use in-order execution if they are statically

scheduled (e.g., Cell SPU), or out-of-order execution if they are dynamically

scheduled (Intel Pentium) VLIW processors, in contrast, issue a fixed number of

instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)


48

VLIW: Very Large Instruction Word Each “instruction” has explicit coding for multiple operations

In IA-64, grouping called a “packet” In Transmeta, grouping called a “molecule” (with “atoms” as ops)

Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long

instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

Need compiling technique that schedules across several branches


49

Recall: Unrolled Loop that Minimizes Stalls for Scalar1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles


50

Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchL.D F0,0(R1) L.D F6,-8(R1) 1L.D F10,-16(R1) L.D F14,-24(R1) 2L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4

ADD.D F20,F18,F2 ADD.D F24,F22,F2 5S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6S.D -16(R1),F12 S.D -24(R1),F16 7S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8S.D -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)Average: 2.5 ops per clock, 50% efficiencyNote: Need more registers in VLIW (15 vs. 6 in SS)


51

Problems with 1st Generation VLIW Increase in code size

generating enough operations in a straight-line code fragment requires ambitiously unrolling loops

whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding

Operated in lock-step; no hazard detection HW a stall in any functional unit pipeline caused entire processor to

stall, since all functional units must be kept synchronized Compiler might prediction function units, but caches hard to predict

Binary code compatibility Pure VLIW => different numbers of functional units and unit

latencies require different versions of the code


52

Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture 128 64-bit integer regs + 128 82-bit floating point regs

Not separate register files per functional unit as in old VLIW Hardware checks dependencies

(interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags)

=> fewer mispredictions? Itanium™ was first implementation (2001)

Highly parallel and deeply pipelined hardware at 800Mhz 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

Itanium 2™ is name of 2nd implementation (2005) 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3


53

Branch Hints

Memory Hints

InstructionCache& BranchPredictors

FetchMemory

Subsystem

Three levels of cache:L1, L2, L3

Register Stack & Rotation

Explicit Parallelism

128 GR &128 FR,RegisterRemap&Stack Engine

Register Handling

Fast, Simple 6-Issue

Issue Control

Micro-architecture Features in hardware:

Itanium™ EPIC Design (Copyright: Intel at Hotchips ’00)

Architecture Features programmed by compiler:

Predication Data & ControlSpeculation

Bypasses & Dependencies

Parallel Resources

4 Integer + 4 MMX Units

2 FMACs (4 for SSE)

2 LD/ST units

32 entry ALAT

Speculation Deferral ManagementILP: speculation and multiple-instruction issue

54

Increasing Instruction Fetch Bandwidth

• Predicts next instruct address, sends it out before decoding instruction

• PC of branch sent to BTB• When match is found, Predicted

PC is returned• If branch predicted taken,

instruction fetch continues at Predicted PC


55

IF BW: Return Address Predictor Small buffer of return

addresses acts as a stack

Caches most recent return addresses

Call Push a return address on stack

Return Pop an address off stack & predict as new PC

0%

10%

20%

30%

40%

50%

60%

70%

0 1 2 4 8 16Return address buffer entries

Mis

pre

dic

tio

n f

req

ue

ncy

gom88ksimcc1compressxlispijpegperlvortex


56

More Instruction Fetch Bandwidth Integrated branch prediction branch predictor is part of

instruction fetch unit and is constantly predicting branches Instruction prefetch Instruction fetch units prefetch to

deliver multiple instruct. per clock, integrating it with branch prediction

Instruction memory access and buffering Fetching multiple instructions per cycle:

May require accessing multiple cache blocks (prefetchto hide cost of crossing cache blocks)

Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed


57

Speculation: Register Renaming vs. ROB Alternative to ROB is a larger physical set of

registers combined with register renaming Extended registers replace function of both ROB and reservation

stations

Instruction issue maps names of architectural registers to physical register numbers in extended register set On issue, allocates a new unused register for the destination

(which avoids WAW and WAR hazards) Speculation recovery easy because a physical register holding an

instruction destination does not become the architectural register until the instruction commits

Most Out-of-Order processors today use extended registers with renaming


58

Register Renaming with Extended Set of Physical Registers


59

First Pentium-4: Willamette

Die Photo Heat Sink


60

The Intel Pentium 4 Microarchitecture


61

Pentium-4 Pipeline

20 Pipeline Stages Drive Wire Delay Trace-Cache: caching paths through the code for quick decoding. Renaming: similar to Tomasulo architecture Branch prediction and DATA prefetching!

Pentium (Original 586)Pentium-II (and III) (Original 686)


62

32nm Intel Quad-Core Sandy Bridge (2011)


63

Value Prediction Attempts to predict value produced by instruction

E.g., Loads a value that changes infrequently Value prediction is useful only if it significantly

increases ILP Focus of research has been on loads; so-so

results, no processor uses value prediction Related topic is address aliasing prediction

RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and

simpler since need not actually predict the address values, only whether such values conflict Has been used by a few processors


64

Perspective Interest in multiple-issue because wanted to improve

performance without affecting uniprocessor programming model

Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice

Conservative in ideas, just faster clock and bigger Recent processors (Intel Pentium, IBM Power 5, AMD

Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 Clocks 10 to 20X faster, caches 5 to 20X bigger, 2 to

4X as many renaming registers, and 2X as many load-store units performance 8 to 16X

Peak v. delivered performance gap increasing


65

Limits to ILP Conflicting studies of amount

Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication

How much ILP is available using existing mechanisms with increasing HW budgets?

Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc.


66

Limits to ILPInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:

1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions3. Jump prediction – all jumps perfectly predicted (returns, case statements)2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW

Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;


67

Model Power 5Instructions Issued per clock

Infinite 4

Instruction Window Size

Infinite 200

Renaming Registers

Infinite 48 integer + 40 Fl. Pt.

Branch Prediction Perfect 2% to 6% misprediction(Tournament Branch Predictor)

Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias Analysis

Perfect ??

Limits to ILP HW Model Comparison


68

Upper Limit to ILP: Ideal Machine

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

Inst

ruct

ions

Per

Clo

ck


69

New Model Model Power 5Instructions Issued per clock

Infinite Infinite 4


Infinite, 2K, 512, 128, 32

Infinite 200

Renaming Registers

Infinite Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

Perfect Perfect 2% to 6% misprediction(Tournament Branch Predictor)

Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect Perfect ??



70

5563

18

75

119

150

36 41

15

61 59 60

10 15 12

49

16

45

10 13 11

35

15

34

8 8 914

914

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doduc tomcatv

Inst

ruct

ions

Per

Clo

ck

Inf inite 2048 512 128 32

More Realistic HW: Window ImpactChange from Infinite window 2048, 512, 128, 32 FP: 9 - 150

Integer: 8 - 63

IPC


71ILP: speculation and multiple-instruction issue


64 Infinite 4


2048 Infinite 200

Renaming Registers

Infinite Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

Perfect vs. 8K Tournament vs. 512 2-bit vs. profile vs. none

Perfect 2% to 6% misprediction(Tournament Branch Predictor)


Memory Alias

Perfect Perfect ??


72

35

41

16

6158

60

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatvProgram

Perfect Selective predictor Standard 2-bit Static None

More Realistic HW: Branch ImpactChange from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle

ProfileBHT (512)TournamentPerfect No prediction

FP: 15 - 45

Integer: 6 - 12

IPC


73

Misprediction Rates

1%

5%

14%12%

14%12%

1%

16%18%

23%

18%

30%

0%3% 2% 2%

4%6%

0%

5%

10%

15%

20%

25%

30%

35%

tomcatv doduc fpppp li espresso gcc

Mis

pred

ictio

n Ra

te

Profile-based 2-bit counter Tournament


74


64 Infinite 4


2048 Infinite 200

Renaming Registers

Infinite v. 256, 128, 64, 32, none

Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

8K 2-bit Perfect Tournament Branch Predictor


Memory Alias

Perfect Perfect Perfect



75

11

15

12

29

54

10

15

12

49

16

1013

12

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

0

10

20

30

40

50

60

70

gcc espresso li fpppp doducd tomcatvProgram

Infinite 256 128 64 32 None

More Realistic HW: Renaming Register Impact (N int + N fp)

Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction

64 None256Infinite 32128

Integer: 5 - 15

FP: 11 - 45

IPC


76

New Model Model Power 5

Instructions Issued per clock

64 Infinite 4


2048 Infinite 200

Renaming Registers

256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

8K 2-bit Perfect Tournament


Memory Alias

Perfect v. Stack v. Inspect v. none

Perfect Perfect



77

Program

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

4 5 4 46 5

35

3 3 4 4

45

Perfect Global/stack Perfect Inspection None

More Realistic HW: Memory Address Alias Impact

Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC


78

New Model Model Power 5

Instructions Issued per clock

64 (no restrictions)

Infinite 4


Infinite vs. 256, 128, 64, 32

Infinite 200

Renaming Registers

64 Int + 64 FP Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

1K 2-bit Perfect Tournament


Memory Alias

HW disambiguation

Perfect Perfect



79

Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Realistic HW: Window ImpactPerfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

64 16256Infinite 32128 8 4

Integer: 6 - 12

FP: 8 - 45

IPC


80

HW vs. SW to Increase ILP Memory disambiguation: HW best

WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses

the memory addresses of a previous frame on the stack

Speculation: HW best when dynamic branch prediction better than compile

time prediction Exceptions easier for HW HW doesn’t need bookkeeping code or compensation code Very complicated to get right

Scheduling: SW can look ahead to schedule better

Compiler independence: does not require new compiler, recompilation to run well


81

In Conclusion Limits to ILP (power efficiency, compilers,

dependencies …) seem to limit to 3 to 6 issue for practical options

These are not laws of physics; just practical limits for today, and perhaps overcome via research

Compiler and ISA advances could change results


82

Acknowledgements These slides contain material developed and copyright

by: Morgan Kauffmann (Elsevier, Inc.) Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) Christos Kozyrakis (Stanford) David Patterson (UCB)


instruction-level parallelism (ilp): speculation, reorder ...midor/ese545/ca_ilp...

Documents