instruction level parallelism and dynamic execution #3:

71
CPSC614 Lec 4.1 Instruction Level Parallelism and Dynamic Execution #3: Based on Prof. David A. Patterson

Upload: luz

Post on 15-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Instruction Level Parallelism and Dynamic Execution #3:. Based on Prof. David A. Patterson. Example: Evaluating Branch Alternatives. Assume: Conditional & Unconditional = 14%, 65% change PC SchedulingBranchCPIspeedup v. scheme penaltystall Stall pipeline31.421.0 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.1

Instruction Level Parallelism and Dynamic Execution #3:

Based on

Prof. David A. Patterson

Page 2: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.2

Example: Evaluating Branch Alternatives

Assume: Conditional & Unconditional = 14%, 65% change PC

Scheduling Branch CPIspeedup v. scheme penalty stall

Stall pipeline 3 1.42 1.0Predict taken 1 1.14 1.26Predict not taken 1 1.09 1.29Delayed branch 0.5 1.07 1.31

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty

Page 3: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.3

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execute—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

• Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /

Page 4: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.4

Tomasulo ExampleInstruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Clock cycle counter

FU countdown

Instruction stream

3 Load/Buffers

3 FP Adder R.S.2 FP Mult R.S.

Page 5: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.5

Tomasulo Example Cycle 1Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

Page 6: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.6

Tomasulo Example Cycle 2Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Can have multiple loads outstanding

Page 7: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.7

Tomasulo Example Cycle 3Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued

• Load1 completing; what is waiting for Load1?

Page 8: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.8

Tomasulo Example Cycle 4Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2?

Page 9: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.9

Tomasulo Example Cycle 5Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

• Timer starts down for Add1, Mult1

Page 10: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.10

Tomasulo Example Cycle 6Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here despite name dependency on F6?

Page 11: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.11

Tomasulo Example Cycle 7Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 (SUBD) completing; what is waiting for it?

Page 12: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.12

Tomasulo Example Cycle 8Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 13: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.13

Tomasulo Example Cycle 9Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No1 Add2 Yes ADDD (M-M) M(A2)

Add3 No6 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 14: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.14

Tomasulo Example Cycle 10Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No0 Add2 Yes ADDD (M-M) M(A2)

Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add2 (ADDD) completing; what is waiting for it?

Page 15: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.15

Tomasulo Example Cycle 11Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Write result of ADDD here?• All quick instructions complete in this cycle!

Page 16: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.16

Tomasulo Example Cycle 12Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 17: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.17

Tomasulo Example Cycle 13Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 18: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.18

Tomasulo Example Cycle 14Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 19: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.19

Tomasulo Example Cycle 15Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Mult1 (MULTD) completing; what is waiting for it?

Page 20: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.20

Tomasulo Example Cycle 16Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Just waiting for Mult2 (DIVD) to complete

Page 21: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.21

Faster than light computation(skip a couple of cycles)

Page 22: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.22

Tomasulo Example Cycle 55Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 23: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.23

Tomasulo Example Cycle 56Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Mult2 (DIVD) is completing; what is waiting for it?

Page 24: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.24

Tomasulo Example Cycle 57Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

• Once again: In-order issue, out-of-order execution and out-of-order completion.

Page 25: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.25

Tomasulo Drawbacks

• Complexity– delays of 360/91, MIPS 10000, Alpha 21264,

IBM PPC 620 in CA:AQA 2/e, but not in silicon!

• Many associative stores (CDB) at high speed

• Performance limited by Common Data Bus– Each CDB must go to multiple functional units

high capacitance, high wiring density– Number of functional units that can complete per

cycle limited to one!» Multiple CDBs more FU logic for parallel assoc

stores

• Non-precise interrupts!– We will address this later

Page 26: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.26

Tomasulo Loop Example

Loop: LD F0 0 R1

MULTD F4 F0 F2

SD F4 0 R1

SUBI R1 R1 #8

BNEZ R1 Loop

• This time assume Multiply takes 4 clocks• Assume 1st load takes 8 clocks

(L1 cache miss), 2nd load takes 1 clock (hit)• To be clear, will show clocks for SUBI, BNEZ

– Reality: integer instructions ahead of Fl. Pt. Instructions

• Show 2 iterations

Page 27: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.27

Loop ExampleInstruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F300 80 Fu

Added Store Buffers

Value of Register used for address, iteration control

Instruction Loop

Iter-ationCount

Page 28: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.28

Loop Example Cycle 1Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 80

Load2 NoLoad3 NoStore1 NoStore2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Fu Load1

Page 29: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.29

Loop Example Cycle 2Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No

Load3 NoStore1 NoStore2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Fu Load1 Mult1

Page 30: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.30

Loop Example Cycle 3Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F303 80 Fu Load1 Mult1

• Implicit renaming sets up data flow graph

Page 31: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.31

Loop Example Cycle 4Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F304 80 Fu Load1 Mult1

• Dispatching SUBI Instruction (not in FP queue)

Page 32: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.32

Loop Example Cycle 5Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F305 72 Fu Load1 Mult1

• And, BNEZ instruction (not in FP queue)

Page 33: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.33

Loop Example Cycle 6Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult1

Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F306 72 Fu Load2 Mult1

• Notice that F0 never sees Load from location 80

Page 34: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.34

Loop Example Cycle 7Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No

Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F307 72 Fu Load2 Mult2

• Register file completely detached from computation• First and Second iteration completely overlapped

Page 35: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.35

Loop Example Cycle 8Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F308 72 Fu Load2 Mult2

Page 36: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.36

Loop Example Cycle 9Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F309 72 Fu Load2 Mult2

• Load1 completing: who is waiting?• Note: Dispatching SUBI

Page 37: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.37

Loop Example Cycle 10Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3010 64 Fu Load2 Mult2

• Load2 completing: who is waiting?• Note: Dispatching BNEZ

Page 38: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.38

Loop Example Cycle 11Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3011 64 Fu Load3 Mult2

• Next load in sequence

Page 39: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.39

Loop Example Cycle 12Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3012 64 Fu Load3 Mult2

• Why not issue third multiply?

Page 40: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.40

Loop Example Cycle 13Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3013 64 Fu Load3 Mult2

• Why not issue third store?

Page 41: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.41

Loop Example Cycle 14Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3014 64 Fu Load3 Mult2

• Mult1 completing. Who is waiting?

Page 42: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.42

Loop Example Cycle 15Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3015 64 Fu Load3 Mult2

• Mult2 completing. Who is waiting?

Page 43: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.43

Loop Example Cycle 16Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3016 64 Fu Load3 Mult1

Page 44: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.44

Loop Example Cycle 17Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3017 64 Fu Load3 Mult1

Page 45: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.45

Loop Example Cycle 18Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3018 64 Fu Load3 Mult1

Page 46: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.46

Loop Example Cycle 19Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3019 56 Fu Load3 Mult1

Page 47: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.47

Loop Example Cycle 20Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 Yes 561 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3020 56 Fu Load1 Mult1

• Once again: In-order issue, out-of-order execution and out-of-order completion.

Page 48: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.48

Why can Tomasulo overlap iterations of loops?

• Register renaming– Multiple iterations use different physical destinations

for registers (dynamic loop unrolling).

• Reservation stations – Permit instruction issue to advance past integer control

flow operations– Also buffer old values of registers - totally avoiding the

WAR stall that we saw in the scoreboard.

• Other perspective: Tomasulo building data flow dependency graph on the fly.

Page 49: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.49

Tomasulo’s scheme offers 2 major advantages

(1)the distribution of the hazard detection logic– distributed reservation stations and the CDB– If multiple instructions waiting on single result,

& each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB

– If a centralized register file were used, the units would have to read their results from the registers when register buses are available.

(2) the elimination of stalls for WAW and WAR hazards

Page 50: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.50

Dynamic Hardware Branch Prediction

• As the amount of ILP grows, control dependences rapidly become the limiting factor.

• Basic Schemes (Static)– Prediction Not Taken– Delayed Branch– Action taken does not depend on the dynamic behavior of

the branch.

Page 51: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.51

Dynamic Hardware Prediction

• Goal– Allow the processor to resolve the

outcome of a branch early, thus preventing control dependences from causing stalls.

Page 52: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.52

Branch-Prediction Buffer

• Branch history buffer• Small memory indexed by the lower

portion of the address of the branch instruction

• 1 bit says whether the branch was taken or not

• Simplest

Page 53: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.53

Example

• Consider a loop branch whose behavior is taken nine times in a row, then not taken once. What is the prediction accuracy for this branch?

Even if a branch is almost always taken, we predict incorrectly twice when it is not taken.

Page 54: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.54

2-bit Prediction Scheme

• A prediction must miss twice before it is changed.

Page 55: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.55

n-bit Saturating Counter

• Values: 0 ~ 2n-1

• When the counter is greater than or equal to one-half of its maximum value, the branch is predicted as taken. Otherwise, not taken.

• Studies have shown that the 2-bit predictors do almost do well, and thus most systems rely on 2-bit branch predictors.

Page 56: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.56

4096-entry 2-bitprediction buffer

Page 57: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.57

Page 58: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.58

Correlating Branch Predictor

• It may possible to improve the accuracy if we look at the behavior of other branches.

if (aa == 2)aa = 0;

if (bb == 0)bb = 0;

if (aa != bb)

The behavior of b3 is correlated with the behavior of b1 and b2.

Page 59: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.59

Correlating Predictors

• Two-level predictors

if (d == 0)d = 1;

if (d == 1)

Page 60: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.60

initial value of d

b1 value of d before b2

b2

0

1

2

Page 61: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.61

1-bit Predictor (Initialized to NT)

d b1 predic

b1 action

new b1 pr

b2 predic

b2 action

new b2 pr

2

0

2

0

Page 62: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.62

(1,1) Predictor

• Every branch has two separate prediction bits.– First bit: the prediction if the last branch in the

program is not taken.– Second bit: the prediction if the last branch in

the program is taken.

• Write the pair of prediction bits together.

Page 63: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.63

Combinations & Meaning

Prediction bits Prediction if not taken

Prediction if taken

Page 64: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.64

(m,n) Predictor

• Uses the last m branches to choose from 2m branch predictors, each of which is an n-bit predictor.

• Yields higher prediction rates than 2-bit scheme

• Requires a trivial amount of additional hardware

• The global history of the most recent m branches are recorded in an m-bit shift register.

Page 65: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.65

Page 66: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.66

(m,n) Predictor

• Total number of bits:

= 2m x n x #prediction entries selected by the branch address

• Examples

Page 67: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.67

Page 68: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.68

Tournament Predictors

• Most popular multilevel branch predictors

Page 69: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.69

Tournament Predictors

• By using multiple predictors (one based on global information, one based on local information, and combining them with a selector), it can select the right predictor for the right branch.

• Alpha 21264– Uses most sophisticated branch predictor as of

2001.

Page 70: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.70

Page 71: Instruction Level Parallelism and Dynamic Execution #3:

CPSC614Lec 4.71