instruction-level parallelism (ilp): speculation, reorder ...midor/ese545/ca_ilp...

82
1 ILP: speculation and multiple-instruction issue ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW Computer Architecture

Upload: others

Post on 06-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

1ILP: speculation and multiple-instruction issue

ESE 545 Computer Architecture

Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Computer Architecture

Page 2: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

2

Review from Last Lecture Leverage Implicit Parallelism for Performance:

Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP Dynamic HW exploiting ILP

Works when can’t know dependence at compile time

Can hide L1 cache misses Code for one machine runs well on another

ILP: speculation and multiple-instruction issue

Page 3: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

3

Review from Last lecture Reservations stations: renaming to larger set of registers + buffering

source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW

Not limited to basic blocks (integer units gets ahead, beyond branches)

Helps cache misses as well Lasting Contributions

Dynamic scheduling Register renaming Load/store disambiguation

360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

ILP: speculation and multiple-instruction issue

Page 4: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

4

Speculation to Greater ILP Greater ILP: Overcome control dependence

by hardware speculating on outcome of branches and executing program as if guesses were correct Speculation fetch, issue, and execute

instructions as if branch predictions were always correct

Dynamic scheduling only fetches and issues instructions

Essentially a data flow execution model: Operations execute as soon as their operands are available

ILP: speculation and multiple-instruction issue

Page 5: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

5

Speculation to Greater ILP 3 components of HW-based speculation:1. Dynamic branch prediction to choose which

instructions to execute 2. Speculation to allow execution of instructions

before control dependences are resolved + ability to undo effects of incorrectly

speculated sequence 3. Dynamic scheduling to deal with scheduling of

different combinations of basic blocks

ILP: speculation and multiple-instruction issue

Page 6: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

6

Adding Speculation to Tomasulo Must separate execution from allowing instruction

to finish or “commit” This additional step called instruction commit When an instruction is no longer speculative, allow

it to update the register file or memory Requires additional set of buffers to hold results of

instructions that have finished execution but have not committed

This reorder buffer (ROB) is also used to pass results among instructions that may be speculated

ILP: speculation and multiple-instruction issue

Page 7: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

7

Phases of Instruction ExecutionFetch: Instruction bits retrieved from instruction cache.I‐cache

Fetch Buffer

Issue Buffer

Functional Units

ArchitecturalState

Execute: Instructions and operands issued to functional units. When execution completes, all results and exception flags are available.

Decode: Instructions dispatched to appropriate issue buffer

Reorder BufferCommit: Instruction irrevocably updates architectural state (aka “graduation”), or takes precise trap/interrupt.

PC

Commit

Decode/Rename

ILP: speculation and multiple-instruction issue

Page 8: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

8

Reorder Buffer (ROB) In Tomasulo’s algorithm, once an instruction writes its

result, any subsequently issued instructions will find result in the register file

With speculation, the register file is not updated until the instruction commits (we know definitively that the instruction should

execute) Thus, the ROB supplies operands in interval between

completion of instruction execution and instruction commit ROB is a source of operands for instructions, just as

reservation stations (RS) provide operands in Tomasulo’s algorithm

ROB extends architectured registers like RSILP: speculation and multiple-instruction issue

Page 9: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

9

Separating Completion from Commit Re-order buffer holds register results from

completion until commit Entries allocated in program order during decode Buffers completed values and exception state until in-order

commit point Completed values can be used by dependents before committed

(bypassing) Each entry holds program counter, instruction type, destination

register specifier and value if any, and exception status (info often compressed to save hardware)

Memory reordering needs special data structures Speculative store address and data buffers Speculative load address and data buffers

ILP: speculation and multiple-instruction issue

Page 10: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

10

Reorder Buffer Entry Each entry in the ROB contains four fields: 1. Instruction type

• a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations)

2. Destination• Register number (for loads and ALU operations) or

memory address (for stores) where the instruction result should be written

3. Value• Value of instruction result until the instruction commits

4. Ready• Indicates that instruction has completed execution, and the value

is ready

ILP: speculation and multiple-instruction issue

Page 11: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

11

Reorder Buffer Operation Holds instructions in FIFO order, exactly as issued When instructions complete, results placed into ROB

Supplies operands to other instruction between execution complete & commit more registers like RS

Tag results with ROB buffer number instead of reservation station Instructions commit values at head of ROB placed in registers As a result, easy to undo

speculated instructions on mispredicted branches or on exceptions Reorder

BufferFPOp

Queue

FP Adder FP AdderRes Stations Res Stations

FP Regs

Commit path

ILP: speculation and multiple-instruction issue

Page 12: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

12

4 Steps of Speculative Tomasulo Algorithm1. Issue—get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2. Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

ILP: speculation and multiple-instruction issue

Page 13: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

13

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) N

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

ILP: speculation and multiple-instruction issue

Page 14: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

14

2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F10F10F0

ADDD F10,F4,F0LD F0,10(R2)

NN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

ILP: speculation and multiple-instruction issue

Page 15: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

15

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

ILP: speculation and multiple-instruction issue

Page 16: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

16

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0 ADDD F0,F4,F6 NF4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

5 0+R3

ILP: speculation and multiple-instruction issue

Page 17: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

17

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

ROB5 ST 0(R3),F4ADDD F0,F4,F6

NN

F4 LD F4,0(R3) N-- BNE F2,<…> NF2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

Dest

Reorder Buffer

Registers

1 10+R25 0+R3

ILP: speculation and multiple-instruction issue

Page 18: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

18

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

M[10] ST 0(R3),F4ADDD F0,F4,F6

YN

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)

ILP: speculation and multiple-instruction issue

Page 19: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

19

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

--F0

M[10]<val2>

ST 0(R3),F4ADDD F0,F4,F6

YEx

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> NF2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

ILP: speculation and multiple-instruction issue

Page 20: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

20

--F0

M[10]<val2>

ST 0(R3),F4ADDD F0,F4,F6

YEx

F4 M[10] LD F4,0(R3) Y-- BNE F2,<…> N

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder Buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F10F0

DIVD F2,F10,F6ADDD F10,F4,F0LD F0,10(R2)

NNN

Done?

Dest Dest

Oldest

Newest

from Memory

1 10+R2Dest

Reorder Buffer

Registers

What about memoryhazards???

ILP: speculation and multiple-instruction issue

Page 21: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

21

Avoiding Memory Hazards WAW and WAR hazards through memory are eliminated

with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending

RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its

execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and

2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.

these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

ILP: speculation and multiple-instruction issue

Page 22: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

22

Exceptions and Interrupts Interrupts and Exceptions either interrupt the current instruction or

happen between instructions Possibly large quantities of state must be saved before interrupting

Machines with precise exceptions provide one single point in the program to restart execution All instructions before that point have completed No instructions after or including that point have completed

IBM 360/91 invented “imprecise interrupts” Computer stopped at this PC; its likely close to this address Not so popular with programmers Also, what about Virtual Memory? (Not in IBM 360)

Technique for both precise interrupts/exceptions and speculation: in-order issue, and in-order commit If we speculate and are wrong, need to back up and restart execution to

point at which we predicted incorrectly This is exactly same as need to do with precise exceptions

Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB If a speculated instruction raises an exception, the exception is recorded in

the ROB This is why reorder buffers are used in all new processors!

ILP: speculation and multiple-instruction issue

Page 23: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

23

“Data-in-ROB” Design(HP PA8000, Pentium Pro, Core2Duo, Nehalem)

Managed as circular buffer in program order, new instructions dispatched to free slots, oldest instruction committed/reclaimed when done (“p” bit set on result)

Tag is given by index in ROB (Free pointer value) In dispatch, non-busy source operands read from architectural register

file and copied to Src1 and Src2 with presence bit “p” set. Busy operands copy tag of producer and clear “p” bit.

Set valid bit “v” on dispatch, set issued bit “i” on issue On completion, search source tags, set “p” bit and copy data into src

on tag match. Write result and exception flags to ROB. On commit, check exception status, and copy result into architectural

register file if no trap. On trap, flush machine and ROB, set free=oldest, jump to handler

Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode

Oldest

Free

Page 24: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

24

Managing Rename for Data-in-ROB

If “p” bit set, then use value in architectural register file Else, tag field indicates instruction that will/has produced value For dispatch, read source operands <p,tag,value> from arch.

regfile, then also read <p,result> from producing instruction in ROB at tag index, bypassing as needed. Copy operands to ROB.

Write destination arch. register entry with <0,Free,_>, to assign tag to ROB index of this instruction

On commit, update arch. regfile with <1, _, Result> On trap, reset table (All p=1)

Tagp ValueTagp ValueTagp Value

Tagp Value

One entry per arch. register

Rename table associated with architectural registers, managed in decode/dispatch

ILP: speculation and multiple-instruction issue

Page 25: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

25

ROB

Data Movement in Data-in-ROB DesignArchitectural Register 

FileRead operands during decode

Read operands at issue

Functional Units

Read results for commit

Bypass newer values at dispatch

Result Data

Write results at completion

Write results at commit

Source Operands

Write sources in dispatch

ILP: speculation and multiple-instruction issue

Page 26: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

26

Unified Physical Register File(MIPS R10K, Alpha 21264, Intel Pentium 4 & Sandy/Ivy Bridge) Rename all architectural registers into a single physical

register file during decode, no register values read Functional units read and write from single unified

register file holding committed and temporary registers in execute

Commit only updates mapping of architectural register to physical register, no data movement

Unified Physical Register File

Read operands at issue

Functional Units

Write results at completion

Committed Register Mapping

Decode Stage Register Mapping

ILP: speculation and multiple-instruction issue

Page 27: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

27

Lifetime of Physical Registers

ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)

ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)

Rename

When can we reuse a physical register?When next writer of same architectural register commits

• Physical regfile holds committed and speculative values• Physical registers decoupled from ROB entries (no data in ROB)

ILP: speculation and multiple-instruction issue

Page 28: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

28

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

x5P5x6P6x7

x0P8x1

x2P7x3

x4

ROB

Rename Table

Physical Regs Free List

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<x1>P8 p

ILP: speculation and multiple-instruction issue

Page 29: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

29

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8

ILP: speculation and multiple-instruction issue

Page 30: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

30

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1

ILP: speculation and multiple-instruction issue

Page 31: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

31

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3

ILP: speculation and multiple-instruction issue

Page 32: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

32

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2

ILP: speculation and multiple-instruction issue

Page 33: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

33

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

ILP: speculation and multiple-instruction issue

Page 34: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

34

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3

x ld p P7 x1 P0

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Execute & Commitp

p

p<x1>

P8

x

ILP: speculation and multiple-instruction issue

Page 35: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

35

Physical Register Management

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Execute & Commitp

p

p<x1>

P8

x

p

p<x3>

P7

ILP: speculation and multiple-instruction issue

Page 36: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

36

Review: Improving the 5-stage Pipeline Performance

ILP: speculation and multiple-instruction issue

• Higher clock frequency (lower CCT): deeper pipelines – Overlap more instructions

• Higher CPIideal: wider pipelines – Insert multiple instruction in parallel in the pipeline • Lower CPIstall:

– Diversified pipelines for different functional units – Out-of-order execution

• Balance conflicting goals – Deeper & wider pipelines -> more control hazards – Branch prediction

• It is all works because of instruction-level parallelism (ILP)

Page 37: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

37

Instruction-level Parallelism (ILP)

ILP: speculation and multiple-instruction issue

Page 38: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

38

Deeper Pipelines

ILP: speculation and multiple-instruction issue

Page 39: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

39

Deeper Pipelines Summary Advantages: higher clock frequency

– The workhorse behind multi-GHz processors – Opteron: 11, UltraSparc: 14, Power5: 17, Pentium4: 22/34

Cost – Complexity: more forwarding & stall cases

Disadvantages – More overlapping -> more dependencies -> more stalls

• CPIstall grows due to data and control hazards – Clock overhead becomes increasingly important – Power consumption

ILP: speculation and multiple-instruction issue

Page 40: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

40

Limits to Pipeline Depth Each pipeline stage introduces some overhead (O)

- Delay of pipeline registers – Inequalities in work per stage

• Cannot break up work into stages at arbitrary points – Clock skew

• Clocks to different registers may not be perfectly aligned

If original CCT was T, with N stages CCT is T/N+O – If N -> ∞, speedup = T / (T/N+O) -> T/O

• Assuming that IC and CPI stay constant – Eventually overhead dominates and deeper pipelines

have diminishing returns

ILP: speculation and multiple-instruction issue

Page 41: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

41

Wide or Superscalar Pipelines

ILP: speculation and multiple-instruction issue

Page 42: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

42

Wide Pipelines Summary Advantages: lower CPIideal (1/N)

– Opteron: 3, UltraSparc: 4, Power5: 4, Pentium4: 3 Cost

– Need wider path to instruction cache– Need more ALUs, more register file ports, …– Complexity: more forwarding & stall cases to check

Disadvantages– Parallel execution -> more dependencies ->more stalls

• CPIstall grows due to data and control hazards

ILP: speculation and multiple-instruction issue

Page 43: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

43

Diversified Pipelines w/o Reorder Buffers

ILP: speculation and multiple-instruction issue

Page 44: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

44

A Modern Superscalar Out-of-Order Processor

ILP: speculation and multiple-instruction issue

Page 45: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

45

Branch Penalty in Superscalar OOO Processors

ILP: speculation and multiple-instruction issue

Page 46: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

46

The Challenges for Superscalar Processors Clock frequency: getting close to pipelining limits?

– Clocking overheads, CPI degradation Branch prediction & memory latency

– Limit the practical benefits of out-of-order execution Power consumption

– Gets worse with higher clock & more OOO logic Design complexity

– Grows exponentially with issue width Limited ILP?

• Alternative: single-chip (multicore) multiprocessors

ILP: speculation and multiple-instruction issue

Page 47: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

47

Multiple-instruction Issue:Getting CPI below 1 CPI ≥ 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors:

1. statically-scheduled superscalar processors,2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors

2 types of superscalar processors issue varying numbers of instructions per clock use in-order execution if they are statically

scheduled (e.g., Cell SPU), or out-of-order execution if they are dynamically

scheduled (Intel Pentium) VLIW processors, in contrast, issue a fixed number of

instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)

ILP: speculation and multiple-instruction issue

Page 48: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

48

VLIW: Very Large Instruction Word Each “instruction” has explicit coding for multiple operations

In IA-64, grouping called a “packet” In Transmeta, grouping called a “molecule” (with “atoms” as ops)

Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long

instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

Need compiling technique that schedules across several branches

ILP: speculation and multiple-instruction issue

Page 49: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

49

Recall: Unrolled Loop that Minimizes Stalls for Scalar1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles

ILP: speculation and multiple-instruction issue

Page 50: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

50

Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchL.D F0,0(R1) L.D F6,-8(R1) 1L.D F10,-16(R1) L.D F14,-24(R1) 2L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4

ADD.D F20,F18,F2 ADD.D F24,F22,F2 5S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6S.D -16(R1),F12 S.D -24(R1),F16 7S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8S.D -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)Average: 2.5 ops per clock, 50% efficiencyNote: Need more registers in VLIW (15 vs. 6 in SS)

ILP: speculation and multiple-instruction issue

Page 51: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

51

Problems with 1st Generation VLIW Increase in code size

generating enough operations in a straight-line code fragment requires ambitiously unrolling loops

whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding

Operated in lock-step; no hazard detection HW a stall in any functional unit pipeline caused entire processor to

stall, since all functional units must be kept synchronized Compiler might prediction function units, but caches hard to predict

Binary code compatibility Pure VLIW => different numbers of functional units and unit

latencies require different versions of the code

ILP: speculation and multiple-instruction issue

Page 52: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

52

Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture 128 64-bit integer regs + 128 82-bit floating point regs

Not separate register files per functional unit as in old VLIW Hardware checks dependencies

(interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags)

=> fewer mispredictions? Itanium™ was first implementation (2001)

Highly parallel and deeply pipelined hardware at 800Mhz 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

Itanium 2™ is name of 2nd implementation (2005) 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3

ILP: speculation and multiple-instruction issue

Page 53: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

53

Branch Hints

Memory Hints

InstructionCache& BranchPredictors

FetchMemory

Subsystem

Three levels of cache:L1, L2, L3

Register Stack & Rotation

Explicit Parallelism

128 GR &128 FR,RegisterRemap&Stack Engine

Register Handling

Fast, Simple 6-Issue

Issue Control

Micro-architecture Features in hardware:

Itanium™ EPIC Design (Copyright: Intel at Hotchips ’00)

Architecture Features programmed by compiler:

Predication Data & ControlSpeculation

Bypasses & Dependencies

Parallel Resources

4 Integer + 4 MMX Units

2 FMACs (4 for SSE)

2 LD/ST units

32 entry ALAT

Speculation Deferral ManagementILP: speculation and multiple-instruction issue

Page 54: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

54

Increasing Instruction Fetch Bandwidth

• Predicts next instruct address, sends it out before decoding instruction

• PC of branch sent to BTB• When match is found, Predicted

PC is returned• If branch predicted taken,

instruction fetch continues at Predicted PC

ILP: speculation and multiple-instruction issue

Page 55: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

55

IF BW: Return Address Predictor Small buffer of return

addresses acts as a stack

Caches most recent return addresses

Call Push a return address on stack

Return Pop an address off stack & predict as new PC

0%

10%

20%

30%

40%

50%

60%

70%

0 1 2 4 8 16Return address buffer entries

Mis

pre

dic

tio

n f

req

ue

ncy

gom88ksimcc1compressxlispijpegperlvortex

ILP: speculation and multiple-instruction issue

Page 56: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

56

More Instruction Fetch Bandwidth Integrated branch prediction branch predictor is part of

instruction fetch unit and is constantly predicting branches Instruction prefetch Instruction fetch units prefetch to

deliver multiple instruct. per clock, integrating it with branch prediction

Instruction memory access and buffering Fetching multiple instructions per cycle:

May require accessing multiple cache blocks (prefetchto hide cost of crossing cache blocks)

Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed

ILP: speculation and multiple-instruction issue

Page 57: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

57

Speculation: Register Renaming vs. ROB Alternative to ROB is a larger physical set of

registers combined with register renaming Extended registers replace function of both ROB and reservation

stations

Instruction issue maps names of architectural registers to physical register numbers in extended register set On issue, allocates a new unused register for the destination

(which avoids WAW and WAR hazards) Speculation recovery easy because a physical register holding an

instruction destination does not become the architectural register until the instruction commits

Most Out-of-Order processors today use extended registers with renaming

ILP: speculation and multiple-instruction issue

Page 58: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

58

Register Renaming with Extended Set of Physical Registers

ILP: speculation and multiple-instruction issue

Page 59: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

59

First Pentium-4: Willamette

Die Photo Heat Sink

ILP: speculation and multiple-instruction issue

Page 60: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

60

The Intel Pentium 4 Microarchitecture

ILP: speculation and multiple-instruction issue

Page 61: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

61

Pentium-4 Pipeline

20 Pipeline Stages Drive Wire Delay Trace-Cache: caching paths through the code for quick decoding. Renaming: similar to Tomasulo architecture Branch prediction and DATA prefetching!

Pentium (Original 586)Pentium-II (and III) (Original 686)

ILP: speculation and multiple-instruction issue

Page 62: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

62

32nm Intel Quad-Core Sandy Bridge (2011)

ILP: speculation and multiple-instruction issue

Page 63: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

63

Value Prediction Attempts to predict value produced by instruction

E.g., Loads a value that changes infrequently Value prediction is useful only if it significantly

increases ILP Focus of research has been on loads; so-so

results, no processor uses value prediction Related topic is address aliasing prediction

RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and

simpler since need not actually predict the address values, only whether such values conflict Has been used by a few processors

ILP: speculation and multiple-instruction issue

Page 64: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

64

Perspective Interest in multiple-issue because wanted to improve

performance without affecting uniprocessor programming model

Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice

Conservative in ideas, just faster clock and bigger Recent processors (Intel Pentium, IBM Power 5, AMD

Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995 Clocks 10 to 20X faster, caches 5 to 20X bigger, 2 to

4X as many renaming registers, and 2X as many load-store units performance 8 to 16X

Peak v. delivered performance gap increasing

ILP: speculation and multiple-instruction issue

Page 65: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

65

Limits to ILP Conflicting studies of amount

Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication

How much ILP is available using existing mechanisms with increasing HW budgets?

Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc.

ILP: speculation and multiple-instruction issue

Page 66: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

66

Limits to ILPInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:

1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions3. Jump prediction – all jumps perfectly predicted (returns, case statements)2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW

Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;

ILP: speculation and multiple-instruction issue

Page 67: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

67

Model Power 5Instructions Issued per clock

Infinite 4

Instruction Window Size

Infinite 200

Renaming Registers

Infinite 48 integer + 40 Fl. Pt.

Branch Prediction Perfect 2% to 6% misprediction(Tournament Branch Predictor)

Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias Analysis

Perfect ??

Limits to ILP HW Model Comparison

ILP: speculation and multiple-instruction issue

Page 68: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

68

Upper Limit to ILP: Ideal Machine

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

Inst

ruct

ions

Per

Clo

ck

ILP: speculation and multiple-instruction issue

Page 69: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

69

New Model Model Power 5Instructions Issued per clock

Infinite Infinite 4

Instruction Window Size

Infinite, 2K, 512, 128, 32

Infinite 200

Renaming Registers

Infinite Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

Perfect Perfect 2% to 6% misprediction(Tournament Branch Predictor)

Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect Perfect ??

Limits to ILP HW Model Comparison

ILP: speculation and multiple-instruction issue

Page 70: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

70

5563

18

75

119

150

36 41

15

61 59 60

10 15 12

49

16

45

10 13 11

35

15

34

8 8 914

914

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doduc tomcatv

Inst

ruct

ions

Per

Clo

ck

Inf inite 2048 512 128 32

More Realistic HW: Window ImpactChange from Infinite window 2048, 512, 128, 32 FP: 9 - 150

Integer: 8 - 63

IPC

ILP: speculation and multiple-instruction issue

Page 71: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

71ILP: speculation and multiple-instruction issue

New Model Model Power 5Instructions Issued per clock

64 Infinite 4

Instruction Window Size

2048 Infinite 200

Renaming Registers

Infinite Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

Perfect vs. 8K Tournament vs. 512 2-bit vs. profile vs. none

Perfect 2% to 6% misprediction(Tournament Branch Predictor)

Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect Perfect ??

Limits to ILP HW Model Comparison

Page 72: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

72

35

41

16

6158

60

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatvProgram

Perfect Selective predictor Standard 2-bit Static None

More Realistic HW: Branch ImpactChange from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle

ProfileBHT (512)TournamentPerfect No prediction

FP: 15 - 45

Integer: 6 - 12

IPC

ILP: speculation and multiple-instruction issue

Page 73: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

73

Misprediction Rates

1%

5%

14%12%

14%12%

1%

16%18%

23%

18%

30%

0%3% 2% 2%

4%6%

0%

5%

10%

15%

20%

25%

30%

35%

tomcatv doduc fpppp li espresso gcc

Mis

pred

ictio

n Ra

te

Profile-based 2-bit counter Tournament

ILP: speculation and multiple-instruction issue

Page 74: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

74

New Model Model Power 5Instructions Issued per clock

64 Infinite 4

Instruction Window Size

2048 Infinite 200

Renaming Registers

Infinite v. 256, 128, 64, 32, none

Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

8K 2-bit Perfect Tournament Branch Predictor

Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect Perfect Perfect

Limits to ILP HW Model Comparison

ILP: speculation and multiple-instruction issue

Page 75: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

75

11

15

12

29

54

10

15

12

49

16

1013

12

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

0

10

20

30

40

50

60

70

gcc espresso li fpppp doducd tomcatvProgram

Infinite 256 128 64 32 None

More Realistic HW: Renaming Register Impact (N int + N fp)

Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction

64 None256Infinite 32128

Integer: 5 - 15

FP: 11 - 45

IPC

ILP: speculation and multiple-instruction issue

Page 76: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

76

New Model Model Power 5

Instructions Issued per clock

64 Infinite 4

Instruction Window Size

2048 Infinite 200

Renaming Registers

256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

8K 2-bit Perfect Tournament

Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect v. Stack v. Inspect v. none

Perfect Perfect

Limits to ILP HW Model Comparison

ILP: speculation and multiple-instruction issue

Page 77: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

77

Program

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

4 5 4 46 5

35

3 3 4 4

45

Perfect Global/stack Perfect Inspection None

More Realistic HW: Memory Address Alias Impact

Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC

ILP: speculation and multiple-instruction issue

Page 78: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

78

New Model Model Power 5

Instructions Issued per clock

64 (no restrictions)

Infinite 4

Instruction Window Size

Infinite vs. 256, 128, 64, 32

Infinite 200

Renaming Registers

64 Int + 64 FP Infinite 48 integer + 40 Fl. Pt.

Branch Prediction

1K 2-bit Perfect Tournament

Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

HW disambiguation

Perfect Perfect

Limits to ILP HW Model Comparison

ILP: speculation and multiple-instruction issue

Page 79: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

79

Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Realistic HW: Window ImpactPerfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

64 16256Infinite 32128 8 4

Integer: 6 - 12

FP: 8 - 45

IPC

ILP: speculation and multiple-instruction issue

Page 80: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

80

HW vs. SW to Increase ILP Memory disambiguation: HW best

WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses

the memory addresses of a previous frame on the stack

Speculation: HW best when dynamic branch prediction better than compile

time prediction Exceptions easier for HW HW doesn’t need bookkeeping code or compensation code Very complicated to get right

Scheduling: SW can look ahead to schedule better

Compiler independence: does not require new compiler, recompilation to run well

ILP: speculation and multiple-instruction issue

Page 81: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

81

In Conclusion Limits to ILP (power efficiency, compilers,

dependencies …) seem to limit to 3 to 6 issue for practical options

These are not laws of physics; just practical limits for today, and perhaps overcome via research

Compiler and ISA advances could change results

ILP: speculation and multiple-instruction issue

Page 82: Instruction-Level Parallelism (ILP): Speculation, Reorder ...midor/ESE545/CA_ILP speculation...Instruction Level Parallelism ... 360/91 descendants are Pentium 4, Power 5, AMD Athlon/Opteron,

82

Acknowledgements These slides contain material developed and copyright

by: Morgan Kauffmann (Elsevier, Inc.) Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) Christos Kozyrakis (Stanford) David Patterson (UCB)

ILP: speculation and multiple-instruction issue