1 sixth lecture: chapter 3: cisc processors (tomasulo scheduling and ibm system 360/91) please...

30
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall: Multicycle instructions lead to the requirement of out-of-order execution Control flow scheduling, when performed centrally at the time of decode: ==> Scoreboarding technique implemented in CDC 6600 Dataflow scheduling, if performed in a distributed manner by the FUs themselves at execute time. Instructions are decoded and issued to reservation stations awaiting their operands. ==> Tomasulo scheme in the IBM System/360 Model 91 processor is the basis of modern superscalar processors

Upload: rose-copeland

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

1

Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91)

Please recall: Multicycle instructions lead to the requirement of out-of-order

execution

Control flow scheduling, when performed centrally at the time of decode: ==> Scoreboarding technique implemented in CDC 6600

Dataflow scheduling, if performed in a distributed manner by the FUs themselves at execute time. Instructions are decoded and issued to reservation stations awaiting their operands. ==> Tomasulo scheme in the IBM System/360 Model 91 processor is the basis of modern superscalar processors

Page 2: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

2

Scoreboard Summary

Main advantage: managing multiple FUs out-of-order execution of multi-cycle operations maintaining all data dependences (RAW, WAW, WAR)

Scoreboard limitations: single issue scheme, however: scheme is extendable to multiple-issue in-order issue no renaming antidependences and output dependences may lead to WAR

and WAW stalls, no forwarding hardware all results go through the registers

General limitations (not only valid for scoreboarding) number and types of FUs since contention for FUs leads to structural hazards the amount of parallelism available in code (dependences lead to stalls)

Page 3: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

3

Tomasulo scheme removes some of the scoreboard limitations by forwarding and renaming hardware,

but is still

single issue and in-order issue

Page 4: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

4

Register Renaming

A name dependence occurs when two instructions Inst1 and Inst2 use the same register (or memory location), but there is no data transmitted between Inst1 and Inst2.

If the register is renamed so that Inst1 and Inst2 do not conflict, the two instructions can execute simultaneously or be reordered.

The technique that dynamically eliminates name dependences in registers to avoid WAR and WAW hazard, is called register renaming.

Register renaming can be done statically (= by compiler) or dynamically (= by hardware).

Tomasulo’s algorithm performs register renaming per hardware!

Dynamic renaming in memory is much harder to perform!

Why??

Pointer aliasing problems.

Page 5: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

5

Tomasulo Algorithm

Developed for IBM 360/91 in 1967 (about 3 years after CDC 6600) Hazard detection and execution control are distributed among the functional

units (vs. centralized in scoreboard) Reservation stations at each functional unit control when an instruction can

begin execution at that unit. Common Data Bus broadcasts results to all reservation stations (of all FUs) Load and Stores treated as FUs as well. Each Register has additional flags.

Page 6: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

TomasuloOrganization

LoadBuffers

Load/StoreReservationStations

Registers

Control

Inst

ruct

ions

InstructionUnit

Memory

Memory

Reservation Stations

Operand Bus

Functional UnitFunctional Unit

Com

mon

Dat

a B

ud (

CD

B)

Page 7: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

7

Reservation Station Components

Each FU has one or more reservation stations The reservation station holds:

instructions that have been issued and are awaiting execution at a functional unit, the operands for that instruction if they have already been computed (or the source of

the operands otherwise), the information needed to control the instruction once it has begun execution.

The reservation stations buffer the operands of instructions waiting to issue, eliminating the need to get the operands from registers (similar to forwarding).

The register specifications store register values (scoreboarding: only pointers to the registers!) or pointers to reservation stations that produce the result.

WAR hazards are avoided because an operand is already stored in reservation station even when a write to the same register is performed out-of-order

WAW hazards are avoided because of the use of pointers to reservation stations instead of register pointers as tags on the CDB

Page 8: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

8

Reservation Station Entries

Empty: Indicates reservation station is empty or not InFU: Indicates the instruction is executed in the FU, remains until completion Op: Operation to perform in the unit (e.g., + or –) Dest: Tag of the Reservation Src1, Src2: Value of source operands RS1, RS2: Tag of the Reservation stations producing source registers Vld1, Vld2: Valid flags indicating whether the values are available

Page 9: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

Tomasulo Organization

rese

rvat

ion

stat

ions

R

Value

Vld

Op Vld1Src1Dest RS1 RS2Vld2Src2Empty InFU

12S

S

S

1

f

n

RS status

s

k

1 2 m

register status

registers… r …

…… ……

…… ……

…… ……RS

Page 10: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

10

CBD and Reservation Stations

After completion of the instruction from RS, a result token is formed and passed on the common data bus (CDB) to the register file and, by snooping, directly to all RSs (thus eliminating the need to get the operand value from a register).

The traffic passing on the CDB is continually monitored. A result on the CDB is copied into all RSs awaiting it. CDB allows all units that are waiting for an operand to be loaded

simultaneously. Hence, the RS fetches and buffers an operand as soon it becomes available (dataflow principle).

The load buffers and load/store reservation stations hold data or addresses coming from and going to memory.

Register result status in register set: Indicates which reservation station will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 11: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

11

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from Instruction QueueIf reservation station free, the Tomasulo algorithm issues the instruction and fetches operands from registers if possible. In-order issue!

2. Execution—operate on operands (EX)When both operands ready then dispatch to FU and execute;if not ready, watch CDB for result (check for RAWs). Out-of-order dispatch and execution!

3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available.

Page 12: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

12

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 1 add Reg4, Reg2, Reg30 0 0 0 0 0

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 12 1

Smul 3 1

Sdiv 4 1

cycle 0token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

RValueVldRS

register status

We assume:mul and div need 4 EX cycles,sub and add need 1 EX cycle.

Page 13: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

13

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg40 1 1 1 1 1 add Reg4, Reg2, Reg33 0 0 0 0 0

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 12 1

Smul 3 0 0 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 1

cycle 1token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

RValueVldRS

register status

Page 14: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

14

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg40 0 1 1 1 1 add Reg4, Reg2, Reg33 1 0 0 0 0

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 0 0 sub 2 (R4) 1 0 (R3) 1 02 1

Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 1

cycle 2token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

RValueVldRS

register status

Page 15: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

15

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- - (R3) (R4) (R5) - div Reg6, Reg1, Reg40 0 1 1 1 0 add Reg4, Reg2, Reg33 1 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 0 1 sub 2 (R4) 1 0 (R3) 1 02 1

Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 3

Sdiv 4 0 0 div 6 0 3 (R4) 1 0

cycle 3token.tagtoken.data remaining cycles in FU

registers

RS status

Sadd

rese

rvat

ion

stat

ions

RValueVldRS

register status

Page 16: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

16

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- (R4)-(R3) (R3) - (R5) - div Reg6, Reg1, Reg40 1 1 0 1 0 add Reg4, Reg2, Reg33 0 0 2 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 0 0 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 2

Sdiv 4 0 0 div 6 0 3 (R4) 1 0

cycle 4token.tag 1token.data (R4)-(R3)

registers

RS status

Sadd

rese

rvat

ion

stat

ions

RValueVldRS

register status

sub writes result on CDB and frees RS;add is issued to RS 2 and gets resultfrom CDB in same cycle

Page 17: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

17

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3- (R4)-(R3) (R3) - (R5) - div Reg6, Reg1, Reg40 1 1 0 1 0 add Reg4, Reg2, Reg33 0 0 2 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 0 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 1

Sdiv 4 0 0 div 6 0 3 (R4) 1 0

cycle 5token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

RValueVldRS

register status

Page 18: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

18

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

- (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5) - div Reg6, Reg1, Reg40 1 1 1 1 0 add Reg4, Reg2, Reg33 0 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 0 1 mul 1 (R3) 1 0 (R5) 1 0 0

Sdiv 4 0 0 div 6 0 3 (R4) 1 0

cycle 6token.tag 2token.data (R4)-(R3)+(R3)

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

add and mul complete in the same cycleand compete for the CDB;add gets the CDB, mul is deferred;

Please note the WAR hazard which is automatically solved:add updates Reg4 before div starts executing; however, div has already stored the previous value in its reservation station (only works with in-order issue!)

Page 19: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

19

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

(R3)*(R5) (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 0 0 div 6 (R3)*(R5) 1 0 (R4) 1 0

cycle 7token.tag 3token.data (R3)*(R5)

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

Page 20: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

20

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

(R3)*(R5) (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0

cycle 8token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

Page 21: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

21

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

(R3)*(R5) (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 3

cycle 9token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

Page 22: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

22

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

(R3)*(R5) (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 2

cycle 10token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

Page 23: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

23

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

(R3)*(R5) (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5) - div Reg6, Reg1, Reg41 1 1 1 1 0 add Reg4, Reg2, Reg30 0 0 0 0 4

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 0 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 1

cycle 11token.tagtoken.data

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

Page 24: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

24

Tomasulo Scheduling

mul Reg1, Reg3, Reg51 2 3 4 5 6 sub Reg2, Reg4, Reg3

(R3)*(R5) (R4)-(R3) (R3)(R4)-

(R3)+(R3) (R5)(R3)*(R5)

/(R4) div Reg6, Reg1, Reg41 1 1 1 1 1 add Reg4, Reg2, Reg30 0 0 0 0 0

Empty InFU Op Dest Src1 Vld1 RS1 Src2 Vld2 RS21 1 1 sub 2 (R4) 1 0 (R3) 1 02 1 1 add 4 (R4)-(R3) 1 0 (R3) 1 0

Smul 3 1 1 mul 1 (R3) 1 0 (R5) 1 0

Sdiv 4 1 1 div 6 (R3)*(R5) 1 0 (R4) 1 0 0

cycle 12token.tag 4token.data (R3)*(R5) /(R4)

registers

RS status

Sadd

rese

rvat

ion

stat

ions

R

ValueVldRS

register status

Page 25: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

25

Comment on the Original Tomasulo Scheme

In the original Tomasulo scheme, the CDB is reserved at least two cycles in advance

each instruction stays at least two cycles in the EX phase CDB resource conflicts are solved at CDB reservation time (before

execution)In contrast, we assume CDB resource conflict resolution in WB stage (see cycle 6 in example).

What happens when an instruction is issued and one of its operands is on the CDB in the same cycle?Uncertain in original Tomasulo paper! We assume the instruction snoops the CDB already in issue phase(see cycle 4 in example).

Page 26: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

26

Tomasulo Summary

Prevents register as bottleneck (forwarding from CDB to reservation stations)

Avoids WAR and WAW hazards Not limited to basic blocks (provided branch prediction) Lasting Contributions

Dynamic scheduling Register renaming in reservation stations

However: single-issue scheme, in-order issue scheme!

Implementation in IBM 360/91

Page 27: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

27

IBM 360/91

Belongs to the family of the IBM System/360 architecture which all share the ISA. The IBM System/360 Model 91 was deeply pipelined

(overall pipeline length was 20 stages). Floating-point execution unit: two separate, fully pipelined floating-point FUs, the

adder and the multiplier/divider. The FUs could be used concurrently. Addition took two cycles, multiplication three cycles, and division eleven cycles. Three reservation stations (RS) associated to adder, and two to the

multiplier/divider. A speculative branch prediction was used that speculated the target will be taken,

when the branch target instruction is within the last eight instructions. Memory had a 10-cycle access, it was fully buffered and 32-way interleaved.

The processor could have up to 32 memory accesses pending to reduce latency. But no cache.

Page 28: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

IBM 360/91

Floating-PointBuffers(FLB)

Floating-PointOperatingStack

Floating-PointRegisters(FLR)

FromInstruction Unit

FromStore Unit

ToStoreUnit

Decoder

Add Unit Multiply/DivideUnit

Com

mon

Dat

a B

us (

CD

B)

Reservation Stations

Page 29: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

29

IBM 360/91 Implementation Details

The processor had about 120 000 gates implemented in ECL technology with a 60 ns basic CPU clock.

IBM produced about 12 of the IBM System/360 Model 91 and perhaps twice that number of Model 195 (which was based on Model 91 but had a faster cycle and incorporated a cache).

Page 30: 1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement

30

Lessons Learned from CISC

Modern processors use ideas from RISC and CISC approach. Out-of-order execution is not a new concept - it existed twenty-five years

ago on CISC machines CDC6600 as scoreboarding and on IBM System/360 Model 91 as Tomasulo scheme.

Out-of-order scheduling is quite similar to dataflow and is referred to as micro dataflow by microprocessor researchers.

Next: Chapter 4: Multiple-issue (Superscalar Processors)