ece 4750 computer architecture, fall 2019 t02 fundamental ...€¦ · t02 fundamental processor...

ECE 4750 Computer Architecture, Fall 2019

T02 Fundamental Processor Microarchitecture

School of Electrical and Computer EngineeringCornell University

revision: 2019-09-10-15-43

1 Processor Microarchitectural Design Patterns 4

1.1. Transactions and Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2. Microarchitecture: Control/Datapath Split . . . . . . . . . . . . . . 5

2 TinyRV1 Single-Cycle Processor 6

2.1. High-Level Idea for Single-Cycle Processors . . . . . . . . . . . . . . 7

2.2. Single-Cycle Processor Datapath . . . . . . . . . . . . . . . . . . . . 8

2.3. Single-Cycle Processor Control Unit . . . . . . . . . . . . . . . . . . 14

2.4. Analyzing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 TinyRV1 FSM Processor 17

3.1. High-Level Idea for FSM Processors . . . . . . . . . . . . . . . . . . 18

3.2. FSM Processor Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3. FSM Processor Control Unit . . . . . . . . . . . . . . . . . . . . . . . 25

3.4. Analyzing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 TinyRV1 Pipelined Processor 31

4.1. High-Level Idea for Pipelined Processors . . . . . . . . . . . . . . . 32

4.2. Pipelined Processor Datapath and Control Unit . . . . . . . . . . . . 34

5 Pipeline Hazards: RAW Data Hazards 39

5.1. Expose in Instruction Set Architecture . . . . . . . . . . . . . . . . . 41

5.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3. Hardware Bypassing/Forwarding . . . . . . . . . . . . . . . . . . . 43

5.4. RAW Data Hazards Through Memory . . . . . . . . . . . . . . . . . 47

6 Pipeline Hazards: Control Hazards 48

6.2. Hardware Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3. Interrupts and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Pipeline Hazards: Structural Hazards 59

7.3. Hardware Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 Pipeline Hazards: WAW and WAR Name Hazards 62

8.1. Software Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9 Summary of Processor Performance 65

10 Case Study: Transition from CISC to RISC 69

10.1. Example CISC: IBM 360/M30 . . . . . . . . . . . . . . . . . . . . . . 70

10.2. Example RISC: MIPS R2K . . . . . . . . . . . . . . . . . . . . . . . . 73

been updated in 2017-2019 by Prof. Christina Delimitrou. Download and use of this hand-out is permitted for individual educational non-commercial purposes only. Redistributioneither in part or in whole via both commercial or non-commercial means requires writtenpermission.

1. Processor Microarchitectural Design Patterns 2.0. Transactions and Steps

1. Processor Microarchitectural Design Patterns

TimeProgram

=Instructions

Program× Avg Cycles

Instruction× Time

• Instructions / program depends on source code, compiler, ISA• Avg cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation

Microarchitecture CPI Cycle Time

Single-Cycle Processor 1 longFSM Processor >1 shortPipelined Processor ≈1 short

1.1. Transactions and Steps

• We can think of each instruction as a transaction• Executing a transaction involves a sequence of steps

add addi mul lw sw jal jr bne

Fetch Instruction 3 3 3 3 3 3 3 3

Decode Instruction 3 3 3 3 3 3 3 3

Read Registers 3 3 3 3 3 3 3

Register Arithmetic 3 3 3 3 3 3

Read Memory 3

Write Memory 3

Write Registers 3 3 3 3 3

Update PC 3 3 3 3 3 3 3 3

1. Processor Microarchitectural Design Patterns1.2. Microarchitecture: Control/Datapath Split

1.2. Microarchitecture: Control/Datapath Split

Control Signals Status Signals

Control Unit

Datapath

imemreq_val

imemreq

imemresp

dmemreq_val

dmemreq

dmemresp

Memory

2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors

2. TinyRV1 Single-Cycle Processor

TimeProgram

=Instructions

Instruction× Time

Technology Constraints

• Assume technology wherelogic is not too expensive, sowe do not need to overlyminimize the number ofregisters and combinationallogic

• Assume multi-ported registerfile with a reasonable numberof ports is feasible

• Assume a dual-portedcombinational memory

Control Status

Control Unit

Datapath

<1 cyclecombinational

Memory

regfile

imemreq

imemresp

dmemreq

dmemresp

2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors

2.1. High-Level Idea for Single-Cycle Processors

Read Memory 3

Write Memory 3

Update PC 3 3 3 3 3 3 3 3

FetchInst

DecodeInst

RegArith

ReadMem

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

RegArith

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

UpdatePC

Single-Cycle

WriteReg

2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath

2.2. Single-Cycle Processor Datapath

regfile(read)

regfile(write)

0000000 000rs1rs2 rd 0110011

067111214151920242531ADD

add rd, rs1, rs2

R[rd] ← R[rs1] + R[rs2]PC ← PC + 4

regfile(read)

regfile(write)

000rs1 rd 0010011

06711121415192031

addi rd, rs1, imm

R[rd] ← R[rs1] + sext(imm)PC ← PC + 4

Implementing ADD and ADDI Instructions

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

0000001 000rs1rs2 rd 0110011

067111214151920242531MUL

mul rd, rs1, rs2

R[rd] ← R[rs1] × R[rs2]PC ← PC + 4

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

010rs1 rd 0000011

06711121415192031

lw rd, imm(rs1)

R[rd] ← M[ R[rs1] + sext(imm) ]PC ← PC + 4

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

dmemreq.data

imm_type

rf_wen

010rs1 imm 0100011

06711121415192031

imm rs2

2425SW

sw rs2, imm(rs1)

M[ R[rs1] + sext(imm) ] ← R[rs2]PC ← PC + 4

imm = { inst[31:25], inst[11:7] }

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

dmemreq.data

imm_type

rf_wen

+pc_sel

jalbr_targ

alu_func

rd 1101111

067111231

jal rd, imm

R[rd] ← PC + 4PC ← PC + sext(imm)

imm = { inst[31], inst[19:12],inst[20], inst[30:21], 0 }

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

dmemreq.data

imm_type

rf_wen

pc_sel

jalbr_targ

alu_func

jr_targ

000rs1 1100111

06711121415192031

00000000000000000

jr rs1

PC ← R[rs1]

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

dmemreq.data

imm_type

rf_wen

+pc_sel

jalbr_targ

alu_func

jr_targ

001rs1 1100011

06711121415192031

immimm

bne rs1, rs2, imm

if ( R[rs1] != R[rs2] ) PC ← PC + sext(imm)else PC ← PC + 4

imm = { inst[31], inst[7],inst[30:25], inst[11:8], 0 }

Adding a New Auto-Incrementing Load Instruction

Draw on the datapath diagram what paths we need to use as well asany new paths we will need to add in order to implement the followingauto-incrementing load instruction.

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

dmemreq.data

imm_type

rf_wen

pc_sel

jalbr_targ

alu_func

jr_targ

000rs1 rd 0001011

06711121415192031

lw.ai rd, imm(rs1)

R[rd] ← M[ R[rs1] + sext(imm) ]R[rs1] ← R[rs1] + 4PC ← PC + 4

2. TinyRV1 Single-Cycle Processor 2.3. Single-Cycle Processor Control Unit

2.3. Single-Cycle Processor Control Unit

imem dmempc imm op1 alu wb rf req req

inst sel type sel func sel wen val val

add pc+4 – rf + alu 1 1 0

mul pc+4 – – – mul 1 1 0

lw pc+4 i imm + mem 1 1 1

jr jr i – – – 0 1 0

Need to factor eq status signal into pc_sel signal for BNE!

2.4. Analyzing Performance

TimeProgram

=Instructions

Program× Cycles

Instruction× Time

Cycles

• Instructions / program depends on source code, compiler, ISA• Cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation

2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance

Estimating cycle time

There are many paths through the design that start at a state elementand end at a state element. The “critical path” is the longest path acrossall of these paths. We can usually use a simple first-order static timingestimate to estimate the cycle time (i.e., the clock period and thus alsothe clock frequency).

regfile(read)

regfile(write)

pc_plus4

ir[24:20]

ir[19:15]

imemreq.addr

To control unit

imemresp.data

ir[11:7]

op2_sel

immgenir[31:7]

wb_sel

dmemreq.addr

dmemresp.data

dmemreq.data

imm_type

rf_wen

pc_sel

jalbr_targ

alu_func

jr_targ

• register read = 1τ

• register write = 1τ

• regfile read = 10τ

• regfile write = 10τ

• memory read = 20τ

• memory write = 20τ

• +4 unit = 4τ

• immgen = 2τ

• mux = 3τ

• multiplier = 20τ

• alu = 10τ

• adder = 8τ

2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance

Estimating execution time

Using our first-order equation for processor performance, how long innanoseconds will it take to execute the vector-vector add exampleassuming n is 64?

loop:lw x5, 0(x13)lw x6, 0(x14)add x7, x5, x6sw x7, 0(x12)addi x13, x12, 4addi x14, x14, 4addi x12, x12, 4addi x15, x15, -1bne x15, x0, loopjr x1

Using our first-order equation for processor performance, how long innanoseconds will it take to execute the mystery program assuming n is64 and that we find a match on the last element.

addi x5, x0, 0loop:lw x6, 0(x12)bne x6, x14, fooaddi x10, x5, 0jr x1

foo:addi x12, x12, 4addi x5, x5, 1bne x5, x13, loopaddi x10, x0, -1jr x1

3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors

3. TinyRV1 FSM Processor

TimeProgram

=Instructions

Instruction× Time

• Assume legacy technologywhere logic is expensive, sowe want to minimize thenumber of registers andcombinational logic

• Assume an (unrealistic)combinational memory

• Assume multi-ported registerfiles and memories are tooexpensive, these structurescan only have a singleread/write port

Control Status

Control Unit

Datapath

Memory

regfile

memreq

memresp

3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors

3.1. High-Level Idea for FSM Processors

Read Memory 3

Write Memory 3

Update PC 3 3 3 3 3 3 3 3

FetchInst

DecodeInst

RegArith

ReadMem

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

RegArith

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

UpdatePC

Single-Cycle

WriteReg

FetchInst

DecodeInst

RegArith

ReadMem

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

RegArith

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

UpdatePCF

WriteReg

3.2. FSM Processor Datapath

Implementing an FSM datapath requires thinking about the requiredFSM states, but we will defer discussion of how to implement thecontrol logic to the next section.

3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath

Implementing Fetch Sequence

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

(pseudo-control-signal syntax)

Implementing ADD Instruction

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

(pseudo-control-signal syntax)add rd, rs1, rs2

Full Datapath for TinyRV1 FSM Processor

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

sextimm

immgen_bus_en

c_selc_en

A +? B+?:

memreq.data

imm_type

imm gen

immgen

A == Bcmp:

sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:

alu_bus_en

A − 4−4:

JR0 B0

ADDI Pseudo-Control-SignalFragment

addi rd, rs1, imm

MUL Instruction

mul rd, rs1, rs2

M0: A← RF[x0]M1: B← RF[rs1]M2: C← RF[rs2]M3: A← A +? B;

B← B << 1; C← C >> 1M4: A← A +? B;

B← B << 1; C← C >> 1...

M35: RF[rd]← A +? B; goto F0

LW Instruction

lw rd, imm(rs1)

L0: A← RF[rs1]L1: B← sext(imm_i)L2: memreq.addr← A + BL3: RF[rd]← RD; goto F0

SW Instruction

sw rs2, imm(rs1)

S0: WD← RF[rs2]S1: A← RF[rs1]S2: B← sext(imm_s)S3: memreq.addr← A + B; goto F0

JAL Instruction

jal rd, imm

JA0: RF[rd]← PCJA1: B← sext(imm_j)JA2: PC← A + B; goto F0

JR Instruction

jr rs1

JR0: PC← RF[rs1]; goto F0

BNE Instruction

bne rs1, rs2, imm

B0: A← RF[rs1]B1: B← RF[rs2]B2: B← sext(imm_b);

if A == B goto F0B3: A← PCB4: A← A − 4B5: PC← A + B; goto F0

Adding a Complex Instruction

FSM processors simplify adding complex instructions. New instructionsusually do not require datapath modifications, only additional states.

add.mm rd, rs1, rs2

M[ R[rd] ]←M[ R[rs1] ] + M[ R[rs2] ]

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

sextimm

immgen_bus_en

c_selc_en

A +? B+?:

memreq.data

imm_type

imm gen

immgen

A == Bcmp:

alu_bus_en

A − 4−4:

Adding a New Auto-Incrementing Load Instruction

Implement the following auto-incrementing load instruction usingpseudo-control-signal syntax. Modify the datapath if necessary.

lw.ai rd, imm(rs1)

R[rd]←M[ R[rs1] + sext(imm_i) ]; R[rs1]← R[rs1] + 4

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

sextimm

immgen_bus_en

c_selc_en

A +? B+?:

memreq.data

imm_type

imm gen

immgen

A == Bcmp:

alu_bus_en

A − 4−4:

3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit

3.3. FSM Processor Control Unit

JR0 B0

We will study three techniquesfor implementing FSM controlunits:

• Hardwired control units arehigh-performance, butinflexible

• Horizontal µcodingincreases flexibility, requireslarge control store

• Vertical µcoding is anintermediate design point

Hardwired FSM

ControlSignalLogic

StateTransition

Control Signals(22)

Status Signals(1)

Control signal output table for hardwired control unit

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

sextimm

immgen_bus_en

c_selc_en

A +? B+?:

memreq.data

imm_type

imm gen

immgen

A == Bcmp:

alu_bus_en

A − 4−4:

F0: memreq.addr← PC; A← PCF1: IR← RDF2: PC← A + 4; goto inst

A0: A← RF[rs1]A1: B← RF[rs2]A2: RF[rd]← A + B; goto F0

Bus Enables Register Enables Mux Func RF MReq

state pc ig alu rf rd pc ir a b c wd b c ig alu sel wen val op

F0 1 0 0 0 0 0 0 1 0 0 0 – – – – – 0 1 r

F1 0 0 0 0 1 0 1 0 0 0 0 – – – – – 0 0 –

F2 0 0 1 0 0 1 0 0 0 0 0 – – – +4 – 0 0 –

Vertically Microcoded FSM

ControlSignals

NextState

decoderopcodeF0

Next State Encoding

n : increment uPC by oned : dispatch based on opcodef : goto state F0b : goto state F0 if A == B

Control Signals (22)

muxsel

Status Signals(1)

• Use memory array (called the control store) instead of random logicto encode both the control signal logic and the state transition logic

• Enables a more systematic approach to implementing complexmulti-cycle instructions

• Microcoding can produce good performance if accessing the controlstore is much faster than accessing main memory

• Read-only control stores might be replaceable enabling in-fieldupdates, while read-write control stores can simplify diagnosticsand microcode patches

Control signal store for microcoded control unit

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

sextimm

immgen_bus_en

c_selc_en

A +? B+?:

memreq.data

imm_type

imm gen

immgen

A == Bcmp:

alu_bus_en

A − 4−4:

B0: A← RF[rs1]B1: B← RF[rs2]B2: B← sext(imm_b); if A == B goto F0

B3: A← PCB4: A← A − 4B5: PC← A + B; goto F0

Bus Enables Register Enables Mux Func RF MReq

state pc ig alu rf rd pc ir a b c wd b c ig alu sel wen val op next

B0 0 0 0 1 0 0 0 1 0 0 0 – – – – rs1 0 0 –

B1 0 0 0 1 0 0 0 0 1 0 0 b – – – rs2 0 0 –

B2 0 1 0 0 0 0 0 0 1 0 0 b – b cmp – 0 0 –

B3 1 0 0 0 0 0 0 1 0 0 0 – – – – – 0 0 –

B4 0 0 1 0 0 0 0 1 0 0 0 – – – −4 – 0 0 –

B5 0 0 1 0 0 1 0 0 0 0 0 – – – + – 0 0 –

3. TinyRV1 FSM Processor 3.4. Analyzing Performance

3.4. Analyzing Performance

TimeProgram

=Instructions

Program× Cycles

Instruction× Time

Cycles

Estimating cycle time

ir_enpc_en

pc_bus_en

alu_bus_en

memresp.data

memreq.addr

rd_bus_en

To control unit

RFrf_wen

rf_bus_en

alu_func

alu func

rf_addr_sel

rs1rs2rd

A + 4+4:

A + B+:

sextimm

immgen_bus_en

c_selc_en

A +? B+?:

memreq.data

imm_type

imm gen

immgen

A == Bcmp:

alu_bus_en

A − 4−4:

• register read/write = 1τ

• regfile read/write = 10τ

• mem read/write = 20τ

• immgen = 2τ

• mux = 3τ

• alu = 10τ

• 1b shifter = 1τ

• tri-state buf = 1τ

3. TinyRV1 FSM Processor 3.4. Analyzing Performance

Using our first-order equation for processor performance, how long inunits of τ will it take to execute the vector-vector add exampleassuming n is 64?

Using our first-order equation for processor performance, how long inunits of τ will it take to execute the mystery program assuming n is 64and that we find a match on the last element.

4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors

4. TinyRV1 Pipelined Processor

TimeProgram

=Instructions

Instruction× Time

• Assume modern technologywhere logic is cheap and fast(e.g., fast integer ALU)

• Assume multi-ported registerfiles with a reasonablenumber of ports are feasible

• Assume small amount of veryfast memory (caches) backedby large, slower memory

Control Status

Control Unit

Datapath

Memory

regfile

imemreq

imemresp

dmemreq

dmemresp

4.1. High-Level Idea for Pipelined Processors

• Anne, Brian, Cathy, and Dave each have one load of clothes• Washing, drying, folding, and storing each take 30 minutes

7pm 8pm 9pm 10pm 11pm 12am 1am 2am 3am

Anne'sLoad

Ben'sLoad

Cathy'sLoad

Dave'sLoad

Pipelined Laundry with Slow Dryers

Anne'sLoad

7pm 8pm 9pm

Ben'sLoad

Cathy'sLoad

Dave'sLoad

7pm 8pm 9pm 10pm 11pm

Anne'sLoad

Ben'sLoad

Cathy'sLoad

Dave'sLoad

Fixed Time-Slot Laundry

Pipelined Laundry10pm 12am

Pipelining lessons

• Multiple transactions operate simultaneously using different resources• Pipelining does not help the transaction latency• Pipelining does help the transaction throughput• Potential speedup is proportional to the number of pipeline stages• Potential speedup is limited by the slowest pipeline stage• Potential speedup is reduced by time to fill the pipeline

Applying pipelining to processors

Read Memory 3

Write Memory 3

Update PC 3 3 3 3 3 3 3 3

FetchInst

DecodeInst

RegArith

ReadMem

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

RegArith

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

UpdatePC

Single-Cycle

WriteReg

FetchInst

DecodeInst

RegArith

ReadMem

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

RegArith

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

UpdatePCF

WriteReg

FetchInst

DecodeInst

RegArith

ReadMem

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

RegArith

WriteReg

ReadReg

UpdatePC

FetchInst

DecodeInst

UpdatePC

Pipelined

WriteReg

4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit

4.2. Pipelined Processor Datapath and Control Unit• Incrementally develop an unpipelined datapath• Keep data flowing from left to right• Position control signal table early in the diagram• Divide datapath/control into stages by inserting pipeline registers• Keep the pipeline stages roughly balanced• Forward arrows should avoid “skipping” pipeline registers• Backward arrows will need careful consideration

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

regfile(read)ir[24:20]

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgenalu

op1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

imemreq.addr

imemresp.data

dmemreq.addr

dmemreq.data

dmemresp.data

F D X M W

addi x1, x2, 1

addi x3, x4, 1

addi x5, x6, 1

Adding a new auto-incrementing load instruction

Draw on the above datapath diagram what paths we need to use as wellas any new paths we will need to add in order to implement thefollowing auto-incrementing load instruction.

lw.ai rd, imm(rs1)

R[rd]←M[ R[rs1] + sext(imm) ]; R[rs1]← R[rs1] + 4

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgenalu

op1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

F Stage D Stage X Stage M Stage W Stage

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

val_DX val_XM val_MWval_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

Pipeline diagrams

addi x1, x2, 1

addi x3, x4, 1

addi x5, x6, 1

What would be the total execution time if these three instructions wererepeated 10 times?

Hazards occur when instructions interact with each other in pipeline

• RAW Data Hazards: An instruction depends on a data valueproduced by an earlier instruction

• Control Hazards: Whether or not an instruction should be executeddepends on a control decision made by an earlier instruction

• Structural Hazards: An instruction in the pipeline needs a resourcebeing used by another instruction in the pipeline

• WAW and WAR Name Hazards: An instruction in the pipeline iswriting a register that an earlier instruction in the pipeline is eitherwriting or reading

Stalling and squashing instructions

• Stalling: An instruction originates a stall due to a hazard, causing allinstructions earlier in the pipeline to also stall. When the hazard isresolved, the instruction no longer needs to stall and the pipelinestarts flowing again.

• Squashing: An instruction originates a squash due to a hazard, andsquashes all previous instructions in the pipeline (but not itself). Werestart the pipeline to begin executing a new instruction sequence.

Control logic with no stalling and no squashing

Stage ADatapathLogic

Stage BDatapathLogic

Stage CDatapathLogic

Stage AControlLogic

Stage BControlLogic

Stage CControlLogic

always_ff @( posedge clk )if ( reset )

val_B <= 0else

val_B <= next_val_A

next_val_B = val_B

Control logic with stalling and no squashing

Stage AControlLogic

Stage ADatapath

Stage BControlLogic

Stage CControlLogic

Stage BDatapath

Stage CDatapath

control, ostall signals

reg_en_B

val_Bnext_val_B

reg_en_B = !stall_B

val_B <= 0else if ( reg_en_B )

val_B <= next_val_A

ostall_B = val_B && ( ostall_hazard1_B || ostall_hazard2_B )

stall_B = val_B && ( ostall_B || ostall_C || ... )

next_val_B = val_B && !stall_B

ostall_B Originating stall due to hazards detected in B stage.

stall_B Should we actually stall B stage? Factors in ostalls due to hazardsand ostalls from later pipeline stages.

next_val_B Only send transaction to next stage if transaction in B stage is validand we are not stalling B stage.

Control logic with stalling and squashing

Stage AControlLogic

Stage ADatapath

Stage BControlLogic

Stage CControlLogic

Stage BDatapath

Stage CDatapath

control, ostall signals

reg_en_B

val_Bnext_val_B

control, ostall, osquash signals

reg_en_B = !stall_B|| squash_B

val_B <= 0else if ( reg_en_B )

val_B <= next_val_A

ostall_B = val_B && ( ostall_hazard1_B || ostall_hazard2_B )

stall_B = val_B && ( ostall_B || ostall_C || ... )

osquash_B = val_B && !stall_B && ( osquash_hazard1_B || ... )

squash_B = val_B && ( osquash_C || ... )

next_val_B = val_B && !stall_B && !squash_B

ostall_B Originating stall due to hazards detected in B stage.

stall_B Should we actually stall B stage? Factors in ostalls due to hazardsand ostalls from later pipeline stages.

osquash_B Originating squash due to hazards detected in B stage. If this stageis stalling, do not originate a squash.

squash_B Should we squash B stage? Factors in the originating squashesfrom later pipeline stages. An originating squash from B stagemeans to squash all stages earlier than B, so osquash_B is notfactored into squash_B.

next_val_B Only send transaction to next stage if transaction in B stage is validand we are not stalling or squashing B stage.

5. Pipeline Hazards: RAW Data Hazards

RAW data hazards occur when one instruction depends on a data valueproduced by a preceding instruction still in the pipeline. We usearchitectural dependency arrows to illustrate RAW dependencies inassembly code sequences.

addi x1, x2, 1

addi x3, x1, 1

addi x4, x3, 1

Using pipeline diagrams to illustrate RAW hazards

We use microarchitectural dependency arrows to illustrate RAWhazards on pipeline diagrams.

F D X M W

addi x1, x2, 1

addi x3, x1, 1

addi x4, x3, 1

addi x1, x2, 1

addi x3, x1, 1

addi x4, x3, 1

5. Pipeline Hazards: RAW Data Hazards

Approaches to resolving data hazards

• Expose in Instruction Set Architecture: Expose data hazards in ISAforcing compiler to explicitly avoid scheduling instructions thatwould create hazards (i.e., software scheduling for correctness)

• Hardware Scheduling: Hardware dynamically schedulesinstructions to avoid RAW hazards, potentially allowinginstructions to execute out of order

• Hardware Stalling: Hardware includes control logic that freezeslater instructions until earlier instruction has finished producingdata value; software scheduling can still be used to avoid stalling(i.e., software scheduling for performance)

• Hardware Bypassing/Forwarding: Hardware allows values to besent from an earlier instruction to a later instruction before theearlier instruction has left the pipeline (sometimes called forwarding)

• Hardware Speculation: Hardware guesses that there is no hazardand allows later instructions to potentially read invalid data; detectswhen there is a problem, squashes and then re-executes instructionsthat operated on invalid data

5. Pipeline Hazards: RAW Data Hazards 5.1. Expose in Instruction Set Architecture

5.1. Expose in Instruction Set Architecture

Insert nops to delay read of earlierwrite. These nops count as realinstructions increasinginstructions per program.

addi x1, x2, 1nopnopnopaddi x3, x1, 1nopnopnopaddi x4, x3, 1

Insert independent instructions todelay read of earlier write, andonly use nops if there is notenough useful work

addi x1, x2, 1addi x6, x7, 1addi x8, x9, 1nopaddi x3, x1, 1nopnopnopaddi x4, x3, 1

Pipeline diagram showing exposing RAW data hazards in the ISA

addi x1, x2, 1

addi x6, x7, 1

addi x8, x9, 1

addi x3, x1, 1

addi x4, x3, 1

Note: If hazard is exposed in ISA, software scheduling is required forcorrectness! A scheduling mistake can cause undefined behavior.

5. Pipeline Hazards: RAW Data Hazards 5.2. Hardware Stalling

5.2. Hardware Stalling

Hardware includes control logic that freezes later instructions (in frontof pipeline) until earlier instruction (in back of pipeline) has finishedproducing data value.

Pipeline diagram showing hardware stalling for RAW data hazards

addi x1, x2, 1

addi x3, x1, 1

addi x4, x3, 1

Note: Software scheduling is not required for correctness, but canimprove performance! Programmer or compiler schedules independentinstructions to reduce the number of cycles spent stalling.

Modifications to datapath/control to support hardware stalling

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

op1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

imemreq.addr

imemresp.data

dmemreq.addr

dmemreq.data

dmemresp.data

reg_en_D

reg_en_F

CSig TableStall Logic

5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding

Deriving the stall signal

rs1_en

rs2_en

rf_wen

ostall_waddr_X_rs1_D =val_D && rs1_en_D && val_X && rf_wen_X

&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)

ostall_waddr_M_rs1_D =val_D && rs1_en_D && val_M && rf_wen_M

&& (inst_rs1_D == rf_waddr_M) && (rf_waddr_M != 0)

ostall_waddr_W_rs1_D =val_D && rs1_en_D && val_W && rf_wen_W

&& (inst_rs1_D == rf_waddr_W) && (rf_waddr_W != 0)

... similar for ostall signals for rs2 source register ...

ostall_D = val_D&& ( ostall_waddr_X_rs1_D || ostall_waddr_X_rs2_D

|| ostall_waddr_M_rs1_D || ostall_waddr_M_rs2_D|| ostall_waddr_W_rs1_D || ostall_waddr_W_rs2_D )

5.3. Hardware Bypassing/Forwarding

Hardware allows values to be sent from an earlier instruction (in backof pipeline) to a later instruction (in front of pipeline) before the earlierinstruction has left the pipeline. Sometimes called “forwarding”.

Pipeline diagram showing hardware bypassing for RAW data hazards

addi x1, x2, 1

addi x3, x1, 1

addi x4, x3, 1

Adding single bypass path to support limited hardware bypassing

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgenalu

op1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

reg_en_D

reg_en_F

CSig TableStall & Bypass

op1_byp_sel_D

Deriving the bypass and stall signals

ostall_waddr_X_rs1_D = 0bypass_waddr_X_rs1_D =

val_D && rs1_en_D && val_X && rf_wen_X&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)

Pipeline diagram showing multiple hardware bypass paths

addi x2, x10, 1

addi x2, x11, 1

addi x1, x2, 1

addi x3, x4, 1

addi x5, x3, 1

add x6, x1, x3

sw x5, 0(x1)

Adding all bypass path to support full hardware bypassing

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

op1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

reg_en_D

reg_en_F

op1_byp_sel_D

bypass_from_Xbypass_from_Mbypass_from_W

op2_byp_sel_D

Handling load-use RAW dependencies

ALU-use latency is only one cycle, but load-use latency is two cycles.

lw x1, 0(x2)

addi x3, x1, 1

lw x1, 0(x2)

addi x3, x1, 1

ostall_load_use_X_rs1_D =val_D && rs1_en_D && val_X && rf_wen_X

&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X == lw)

ostall_load_use_X_rs2_D =val_D && rs2_en_D && val_X && rf_wen_X

&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X == lw)

ostall_D =val_D && ( ostall_load_use_X_rs1_D || ostall_load_use_X_rs2_D )

bypass_waddr_X_rs1_D =val_D && rs1_en_D && val_X && rf_wen_X

&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X != lw)

bypass_waddr_X_rs2_D =val_D && rs2_en_D && val_X && rf_wen_X

&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X != lw)

5. Pipeline Hazards: RAW Data Hazards 5.4. RAW Data Hazards Through Memory

Pipeline diagram for simple assembly sequence

Draw a pipeline diagram illustrating how the following assemblysequence would execute on a fully bypassed pipelined TinyRV1processor. Include microarchitectural dependency arrows to illustratehow data is transferred along various bypass paths.

lw x1, 0(x2)

lw x3, 0(x4)

add x5, x1, x3

sw x5, 0(x6)

addi x2, x2, 4

addi x4, x4, 4

addi x6, x6, 4

addi x7, x7, -1

bne x7, x0, loop

5.4. RAW Data Hazards Through Memory

So far we have only studied RAW data hazards through registers, butwe must also carefully consider RAW data hazards through memory.

sw x1, 0(x2)lw x3, 0(x4) # RAW dependency occurs if R[x2] == R[x4]

sw x1, 0(x2)

lw x3, 0(x4)

6. Pipeline Hazards: Control Hazards

Control hazards occur when whether or not an instruction should beexecuted depends on a control decision made by an earlier instructionWe use architectural dependency arrows to illustrate controldependencies in assembly code sequences.

Static Instr Sequence

addi x1, x0, 1jal x0, fooopAopB

foo: addi x2, x3, 1bne x0, x1, baropCopDopE

bar: addi x4, x5, 1

Dynamic Instr Sequence

addi x1, x0, 1jal x0, fooaddi x2, x3, 1bne x0, x1, baraddi x4, x5, 1

Using pipeline diagrams to illustrate control hazards

We use microarchitectural dependency arrows to illustrate controlhazards on pipeline diagrams.

addi x1, x0, 1

jal x0, foo

addi x2, x3, 1

bne x0, x1, bar

addi x4, x5, 1

6. Pipeline Hazards: Control Hazards 6.1. Expose in Instruction Set Architecture

The jump resolution latency and branch resolution latency are thenumber of cycles we need to delay the fetch of the next instruction inorder to avoid any kind of control hazard. Jump resolution latency istwo cycles, and branch resolution latency is three cycles.

addi x1, x0, 1

jal x0, foo

addi x2, x3, 1

bne x0, x1, bar

addi x4, x5, 1

Approaches to resolving control hazards

• Expose in Instruction Set Architecture: Expose control hazards inISA forcing compiler to explicitly avoid scheduling instructions thatwould create hazards (i.e., software scheduling for correctness)

• Software Predication: Programmer or compiler converts controlflow into data flow by using instructions that conditionally executebased on a data value

• Hardware Speculation: Hardware guesses which way the controlflow will go and potentially fetches incorrect instructions; detectswhen there is a problem and re-executes instructions that are alongthe correct control flow

• Software Hints: Programmer or compiler provides hints aboutwhether a conditional branch will be taken or not taken, andhardware can use these hints for more efficient hardware speculation

6. Pipeline Hazards: Control Hazards 6.1. Expose in Instruction Set Architecture

6.1. Expose in Instruction Set Architecture

Expose branch delay slots as part of the instruction set. Branch delayslots are instructions that follow a jump or branch and are alwaysexecuted regardless of whether a jump or branch is taken or not taken.Compiler tries to insert useful instructions, otherwise inserts nops.

addi x1, x0, 1jal x0, foonopopAopB

foo: addi x2, x3, 1bne x0, x1, barnopnopopCopDopE

bar: addi x4, x5, 1

Assume we modify the TinyRV1instruction set to specify that JAL,and JR instructions have asingle-instruction branch delayslot (i.e., one instruction after aJAL and JR is always executed)and the BNE instruction has atwo-instruction branch delay slot(i.e., two instructions after a BNEare always executed).

Pipeline diagram showing using branch delay slots for control hazards

addi x1, x0, 1

jal x0, foo

addi x2, x3, 1

bne x0, x1, bar

addi x4, x5, 1

6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation

6.2. Hardware Speculation

Hardware guesses which way the control flow will go and potentiallyfetches incorrect instructions; detects when there is a problem andre-executes instructions the instructions that are along the correctcontrol flow. For now, we will only consider a simple branch predictionscheme where the hardware always predicts not taken.

Pipeline diagram when branch is not taken

addi x1, x0, 1

jal x0, foo

addi x2, x3, 1

bne x0, x1, bar

Pipeline diagram when branch is taken

addi x1, x0, 1

jal x0, foo

addi x2, x3, 1

bne x0, x1, bar

addi x4, x5, 1

Modifications to datapath/control to support hardware speculation

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

op1_sel_D+

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

reg_en_D

reg_en_F

op1_byp_sel_D

op2_byp_sel_D

pc_sel_F

CSig TableStall, Bypass, &Squash Logic

Deriving the squash signals

osquash_j_D = (op_D == jal) || (op_D == jr)osquash_br_X = (op_X == bne) && !eq_X

Our generic stall/squash scheme gives priority to squashes over stalls.

osquash_D = val_D && !stall_D && osquash_j_Dsquash_D = val_D && osquash_X

osquash_X = val_D && !stall_X && osquash_br_Xsquash_X = 0

Important: PC select logic must give priority to older instructions(i.e., prioritize branches over jumps)! Good quiz question?

Pipeline diagram for simple assembly sequence

Draw a pipeline diagram illustrating how the following assemblysequence would execute on a fully bypassed pipelined TinyRV1processor that uses hardware speculation which always predictsnot-taken. Unlike the “standard” TinyRV1 processor, you should alsoassume that we add a single-instruction branch delay slot to theinstruction set. So this processor will partially expose the controlhazard in the instruction, but also use hardware speculation. Includemicroarchitectural dependency arrows to illustrate both data andcontrol flow.

addi x1, x2, 1bne x0, x3, foo # assume R[rs] != 0addi x4, x5, 1 # instruction is in branch delay slotaddi x6, x7, 1...

foo:add x8, x1, x4addi x9, x1, 1

6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions

6.3. Interrupts and Exceptions

Interrupts and exceptions alter the normal control flow of the program.They are caused by an external or internal event that needs to beprocessed by the system, and these events are usually unexpected orrare from the program’s point of view.

• Asynchronous Interrupts

– Input/output device needs to be serviced– Timer has expired– Power distruption or hardware failure

• Synchronous Exceptions

– Undefined opcode, privileged instruction– Arithmetic overflow, floating-point exception– Misaligned memory access for instruction fetch or data access– Memory protection violation– Virtual memory page faults– System calls (traps) to jump into the operating system kernel

Interrupts and Exception Semantics

• Interrupts are asynchronous with respect to the program, so themicroarchitecture can decide when to service the interrupt

• Exceptions are synchronous with respect to the program, so theymust be handled immediately

• To handle an interrupt or exception the hardware/software must:

– Stop program at current instruction (I), ensure previous insts finished– Save cause of interrupt or exception in privileged arch state– Save the PC of the instruction I in a special register (EPC)– Switch to privileged mode– Set the PC to the address of either the interrupt or the exception handler– Disable interrupts– Save the user architectural state– Check the type of interrupt or exception

– Handle the interrupt or exception

– Enable interrupts– Switch to user mode– Set the PC to EPC if I should be restarted– Potentially set PC to EPC+4 if we should skip I

Handling a misaligned data address and syscall exceptions

Static Instr Sequence

addi x1, x0, 0x2001lw x2, 0(x1)syscallopBopC...

exception_hander:opD # disable interruptsopE # save user registersopF # check exception typeopG # handle exceptionopH # enable interruptsaddi EPC, EPC, 4eret

Dynamic Instr Sequence

addi x1, x0, 0x2001lw x2, 0(x1) (excep)opDopEopFopGopHaddi EPC, EPC, 4eretsyscall (excep)opDopEopF

Interrupts and Exceptions in a RISC-V Pipelined Processor

F D X M W

Inst AddressExceptions

IllegalInstruction

ArithmeticOverflow

Data AddressExceptions

• How should we handle a single instruction which generatesmultiple exceptions in different stages as it goes down the pipeline?

– Exceptions in earlier pipeline stages override later exceptions for a giveninstruction

• How should we handle multiple instructions generating exceptionsin different stages at the same or different times?

– We always want the execution to appear as if we have completelyexecuted one instruction before going onto the next instruction

– So we want to process the exception corresponding to the earliestinstruction in program order first

– Hold exception flags in pipeline until commit point

– Commit point is after all exceptions could be generated but before anyarchitectural state has been updated

– To handle an exception at the commit point: update cause and EPC,squash all stages before the commit point, and set PC to exception handler

• How and where to handle external asynchronous interrupts?

– Inject asynchronous interrupts at the commit point

– Asynchronous interrupts will then naturally override exceptions causedby instructions earlier in the pipeline

Modifications to datapath/control to support exceptions

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

op1_sel_D

alu_fn_Xop2_sel_D

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

btargjtarg

reg_en_D

reg_en_F

op1_byp_sel_D

bypass_from_X

bypass_from_M

bypass_from_W

op2_byp_sel_D

pc_sel_F

rf_wen1_W

rf_waddr1_W

xcept_XMxcept_DXxcept_FD

ExceptionHandling Logic

xcepthandler

Deriving the squash signals

osquash_j_D = (op_D == jal) || (op_D == jr)osquash_br_X = (op_X == bne) && !eq_Xosquash_xcept_M = exception_M

Control logic needs to redirect the front end of the pipeline just like for ajump or branch. Again, squashes take priority over stalls, and PC selectlogic must give priority to older instructions (i.e., priortize exceptions,over branches, over jumps)!

Pipeline diagram of exception handling

addi x1, x0, 0x2001lw x2, 0(x1) # assume causes misaligned address exceptionsyscall # causes a syscall exceptionopBopC...

exception_hander:opD # disable interruptsopE # save user registersopF # check exception typeopG # handle exceptionopH # enable interruptsaddi EPC, EPC, 4eret

7. Pipeline Hazards: Structural Hazards

Structural hazards occur when an instruction in the pipeline needs aresource being used by another instruction in the pipeline. The TinyRV1processor pipeline is specifically designed to avoid structural hazards.

Let’s introduce a structural hazard by allowing instructions that do notdo any real work in the M stage (i.e., non-memory instructions) toeffectively skip that stage. This would require adding an extra pathwhich “skips over” the pipeline reigster between the X and M stagesand connects directly to the writeback mux at the end of the M stage.For non-memory instructions we set wb_sel_M to choose the value fromthe end of the X stage, while for memory instructions we set wb_sel_Mto choose the value coming back from memory.

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

op1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

reg_en_D

reg_en_F

op1_byp_sel_D

op2_byp_sel_D

pc_sel_F

7. Pipeline Hazards: Structural Hazards 7.1. Expose in Instruction Set Architecture

Using pipeline diagrams to illustrate structural hazards

We use structural dependency arrows to illustrate structural hazards.

addi x1, x2, 1

addi x3, x4, 1

lw x5, 0(x6)

addi x7, x8, 1

Note that the key shared resources that are causing the structuralhazard are the pipeline registers at the end of the M stage. We cannotwrite these pipeline registers with the transaction that is in the X stageand also the transaction that is the M stage at the same time.

Approaches to resolving structural hazards

• Expose in Instruction Set Architecture: Expose structural hazards inISA forcing compiler to explicitly avoid scheduling instructions thatwould create hazards (i.e., software scheduling for correctness)

• Hardware Stalling: Hardware includes control logic that freezeslater instructions until earlier instruction has finished using theshared resource; software scheduling can still be used to avoidstalling (i.e., software scheduling for performance)

• Hardware Duplication: Add more hardware so that each instructioncan access separate resources at the same time

7.1. Expose in Instruction Set ArchitectureInsert independent instructions or nops to delay non-memoryinstructions if they follow a LW or SW instruction.

7. Pipeline Hazards: Structural Hazards 7.2. Hardware Stalling

Pipeline diagram showing exposing structural hazards in the ISA

addi x1, x2, 1

addi x3, x4, 1

lw x5, 0(x6)

addi x7, x8, 1

7.2. Hardware StallingHardware includes control logic that stalls a non-memory instruction ifit follows a LW or SW instruction.

Pipeline diagram showing hardware stalling for structural hazards

addi x1, x2, 1

addi x3, x4, 1

lw x5, 0(x6)

addi x7, x8, 1

ostall_wport_hazard_D = val_D && !mem_inst_D && val_X && mem_inst_X

where mem_inst is true for a LW or SW instruction and false otherwise.Stall far before the structural hazard actually occurs, because we knowexactly how instructions move down the pipeline. Also possible to usedynamic arbitration in the back-end of the pipeline.

7. Pipeline Hazards: Structural Hazards 7.3. Hardware Duplication

7.3. Hardware Duplication

Add more pipeline registers at the end of M stage and a second writeport so that non-memory and memory instructions can writeback to theregister file at the same time.

pc_plus4

result_sel_X

ir[31:0]

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

lop1_sel_D

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

mem_result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

non_mem_result_MW

reg_en_D

reg_en_F

op1_byp_sel_D

op2_byp_sel_D

pc_sel_F

rf_wen1_W

rf_waddr1_W

Does allowing early writeback help performance in the first place?

addi x1, x2, 1

addi x3, x1, 1

addi x4, x3, 1

addi x5, x4, 1

addi x6, x5, 1

addi x7, x6, 1

8. Pipeline Hazards: WAW and WAR Name Hazards

WAW dependencies occur when an instruction overwrites a registerthan an earlier instruction has already written. WAR dependenciesoccur when an instruction writes a register than an earlier instructionneeds to read. We use architectural dependency arrows to illustrateWAW and WAR dependencies in assembly code sequences.

mul x1, x2, x3

addi x4, x1, 1

addi x1, x5, 1

WAW name hazards occur when an instruction in the pipeline writes aregister before an earlier instruction (in back of the pipeline) has had achance to write that same register.

WAR name hazards occur when an instruction in the pipeline writes aregister before an earlier instuction (in back of pipeline) has had achance to read that same register.

The TinyRV1 processor pipeline is specifically designed to avoid anyWAW or WAR name hazards. Instructions always write the registerfilein-order in the same stage, and instructions always read registers in thefront of the pipeline and write registers in the back of the pipeline.

Let’s introduce a WAW name hazard by using an iterative variablelatency multiplier, and allowing other instructions to continueexecuting while the multiplier is working.

8. Pipeline Hazards: WAW and WAR Name Hazards 8.1. Software Renaming

Using pipeline diagrams to illustrate WAW name hazards

We use microarchitectural dependency arrows to illustrate WAWhazards on pipeline diagrams.

mul x1, x2, x3

addi x4, x6, 1

addi x1, x5, 1

Approaches to resolving structural hazards

• Software Renaming: Programmer or compiler changes the registernames to avoid creating name hazards

• Hardware Renaming: Hardware dynamically changes the registernames to avoid creating name hazards

• Hardware Stalling: Hardware includes control logic that freezeslater instructions until earlier instruction has finished either writingor reading the problematic register name

8.1. Software Renaming

As long as we have enough architectural registers, renaming registers insoftware is easy. WAW and WAR dependencies occur because we havea finite number of architectural registers.

mul x1, x2, x3addi x4, x1, 1addi x7, x5, 1

9. Summary of Processor Performance 8.2. Hardware Stalling

8.2. Hardware Stalling

Simplest approach is to add stall logic in the decode stage similar towhat the approach used to resolve other hazards.

mul x1, x2, x3

addi x4, x6, 1

addi x1, x5, 1

ostall_struct_hazard_D = val_D && (op_D == MUL) && !imul_rdy_D

ostall_waw_hazard_D =val_D && rf_wen_D && val_Z && rf_wen_Z

&& (rf_waddr_D == rf_waddr_Z) && (rf_waddr_Z != 0)

9. Summary of Processor Performance

TimeProgram

=Instructions

Program× Cycles

Instruction× Time

Cycles

Results for vector-vector add example

Microarchitecture Inst CPI Cycle Time Exec Time

Single-Cycle Processor 576 1.0 74 τ 43 kτ

FSM Processor 576 6.7 40 τ 154 kτ

Pipelined Processor 576

Estimating cycle time for pipelined processor

pc_plus4

result_sel_X

ir[31:0]

wb_sel_M

pc_F +4

ir[19:15]

regfile(write)

rf_wen_W

rf_waddr_W

regfile(write)

Control SignalTable

imemreqaddr

imemrespdata

dmemreqaddr

dmemreqdata

dmemrespdata

imm_type

immgen

op1_sel_D+

alu_fn_Xop2_sel_D

pc_sel_F

jbtarg

result_XM

cs_DX cs_XM cs_MW

result_XM

result_MW

pc_F pc_FD

ControlLogic

op1_DX

btarg_DX

op2_DX

always pc_plus4

btargjtarg

reg_en_D

reg_en_F

op1_byp_sel_D

op2_byp_sel_D

pc_sel_F

• register read = 1τ

• register write = 1τ

• regfile read = 10τ

• regfile write = 10τ

• memory read = 20τ

• memory write = 20τ

• +4 unit = 4τ

• immgen = 2τ

• mux = 3τ

• multiplier = 20τ

• alu = 10τ

• adder = 8τ

Using our first-order equation for processor performance, how long in τwill it take to execute the vvadd example assuming n is 64?

Using our first-order equation for processor performance, how long in τwill it take to execute the mystery program assuming n is 64 and thatwe find a match on the last element.

10. Case Study: Transition from CISC to RISC

• Microcoding thrived in the 1970’s

– ROMs significantly faster than DRAMs– For complex instruction sets, microcode was cheaper and simpler– New instructions supported without modifying datapath– Fixing bugs in controller is easier– ISA compatibility across models relatively straight-forward

— Maurice Wilkes, 1954

10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30

10.1. Example CISC: IBM 360/M30

M30 M40 M50 M65

Datapath width (bits) 8 16 32 64µinst width (bits) 50 52 85 87µcode size (1K µinsts) 4 4 2.75 2.75µstore technology CCROS TROS BCROS BCROSµstore cycle (ns) 750 625 500 200Memory cycle (ns) 1500 2500 2000 750Rental fee ($K/month) 4 7 15 35

TROS = transformer read-only storage (magnetic storage)BCROS = balanced capacitor read-only storage (capacitive storage)

CCROS = card capacitor read-only storage (metal punch cards, replace in field)

Only the fastest models (75,95) were hardwired

IBM 360/M30 microprogram for register-register logical OR

Fetch first byteof operands

Instruction Fetch

Writeback Result

Prepare for Next Byte

IBM 360/M30 microprogram for register-register binary ADD

Analyzing Microcoded Machines

• John Cocke and group at IBM

– Working on a simple pipelined processor, 801, and advanced compilers

– Ported experimental PL8 compiler to IBM 370, and only used simpleregister-register and load/store instructions similar to 801

– Code ran faster than other existing compilers that used all 370instructions! (up to 6 MIPS, whereas 2 MIPS considered good before)

• Joel Emer and Douglas Clark at DEC

– Measured VAX-11/780 using external hardware– Found it was actually a 0.5 MIPS machine, not a 1 MIPS machine– 20% of VAX instrs = 60% of µcode, but only 0.2% of the dynamic execution

• VAX 8800, high-end VAX in 1984

– Control store: 16K×147b RAM, Unified Cache: 64K×8b RAM– 4.5×more microstore RAM than cache RAM!

From CISC to RISC

• Key changes in technology constraints

– Logic, RAM, ROM all implemented with MOS transistors– RAM ≈ same speed as ROM

• Use fast RAM to build fast instruction cache of user-visibleinstructions, not fixed hardware microfragments

– Change contents of fast instruction memory to fit what app needs

• Use simple ISA to enable hardwired pipelined implementation

– Most compiled code only used a few of CISC instructions– Simpler encoding allowed pipelined implementations– Load/Store Reg-Reg ISA as opposed to Mem-Mem ISA

• Further benefit with integration

– Early 1980’s→ fit 32-bit datapath, small caches on single chip– No chip crossing in common case allows faster operation

ROM for

μInst

Decoder

User PC

RAM for

Instr Cache

"Larger"

Decoder

Vertical μCode

ControllerRISC

Controller

10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K

10.2. Example RISC: MIPS R2K

• MIPS R2K is one of the first popularpipelined RISC processors

• MIPS R2K implements the MIPS Iinstruction set

• MIPS = Microprocessor withoutInterlocked Pipeline Stages

MIPSII

MIPSIII

MIPSIV

MIPS64

MIPS32

MIPS16

• MIPS I used software scheduling to avoid some RAW hazards byincluding a single-instruction load-use delay slot

• MIPS I used software scheduling to avoid some control hazards byincluding a single-instruction branch delay slot

One-Instr Branch Delay Slot

addiu r1, r2, 1j fooaddiu r3, r4, 1 # BDS...

foo:addiu r5, r6, 1bne r7, r8, baraddiu r9, r10, 1 # BDS...

Present in all MIPS instructionsets; not possible to depricate andstill enable legacy code to executeon new microarchitectures

One-Instr Load-Use Delay Slot

lw r1, 0(r2)lw r3, 0(r4)addiu r2, r2, 4 # LDSaddu r5, r1, r3

Deprecated in MIPS II instructionset; legacy code can still executeon new microarchitectures, butcode using the MIPS II instructionset can rely in hardware stalling

MIPS R2K Microarchitecture

The pipelined datapath and controlwere located on a single die. Cachecontrol and memory management unitwere also integrated on-die, but theactual tag and data storage for thecache was located off-chip.

Control Unit Datapath

I$ Controller D$ Controller

I$ (4-32KB) D$ (4-32KB)

Used two-phase clocking to enable five pipeline stages to fit into fourclock cycles. This avoided the need for explicit bypassing from the Wstage to the end of the D stage. 1.2 The MIPS Five-Stage Pipeline 5

Instruction 1

Instruction 2

Instruction 3

RDIF ALU WB

RD ALU MEM WB

Instruction sequence

IFfrom

I-cache

MEM from

D-cache

RD from

registerfile

FIGURE 1.2 MIPS five-stage pipeline.

fetch data from the cache; a cache miss is a relatively rare event and we can juststop the CPU when it happens (though cleverer CPUs find more useful thingsto do).

The MIPS architecture was planned with separate instruction and datacaches, so it can fetch an instruction and read or write a memory variable simul-taneously.

CISC architectures have caches too, but they’re most often afterthoughts,fitted in as a feature of the memory system. A RISC architecture makes moresense if you regard the caches as very much part of the CPU and tied firmly intothe pipeline.

1.2 The MIPS Five-Stage Pipeline

The MIPS architecture is made for pipelining, and Figure 1.2 is close to theearliest MIPS CPUs and typical of many. So long as the CPU runs from thecache, the execution of every MIPS instruction is divided into five phases, calledpipestages, with each pipestage taking a fixed amount of time. The fixed amountof time is usually a processor clock cycle (though some actions take only halfa clock, so the MIPS five-stage pipeline actually occupies only four clockcycles).

All instructions are rigidly defined so they can follow the same sequenceof pipestages, even where the instruction does nothing at some stage. The netresult is that, so long as it keeps hitting the cache, the CPU starts an instructionevery clock cycle.

Two-phase clocking enabled a single-cycle branch resolution latencysince register read, branch address generation, and branch comparisoncan fit in a single cycle.

1.5 MIPS Compared with CISC Architectures 27

IF RD ALU MEM WB

WBMEMALURDIF

RD MEM WBBranch

instruction

Branchdelay

Branchtarget

Branchaddress

FIGURE 1.3 The pipeline and branch delays.

on a MIPS CPU without using some registers). For a program runningin any system that takes interrupts or traps, the values of these registersmay change at any time, so you’d better not use them.

1.5.4 Programmer-Visible Pipeline Effects

So far, this has all been what you might expect from a simplified CPU. However,making the instruction set pipeline friendly has some stranger effects as well,and to understand them we’re going to draw some pictures.

Delayed branches: The pipeline structure of the MIPS CPU (Figure 1.3)means that when a jump/branch instruction reaches the execute phaseand a new program counter is generated, the instruction after the jumpwill already have been started. Rather than discard this potentially usefulwork, the architecture dictates that the instruction after a branch mustalways be executed before the instruction at the target of the branch. Theinstruction position following any branch is called the branch delay slot.If nothing special was done by the hardware, the decision to branch ornot, together with the branch target address, would emerge at the endof the ALU pipestage—by which time, as Figure 1.3 shows, you’re toolate to present an address for an instruction in even the next-but-onepipeline slot.But branches are important enough to justify special treatment, and youcan see from Figure 1.3 that a special path is provided through the ALU tomake the branch address available half a clock cycle early. Together withthe odd half-clock-cycle shift of the instruction fetch stage, that meansthat the branch target can be fetched in time to become the next but one,

MIPS R2K VLSI Design

Process: 2 µm, two metal layersClock Frequency: 8–15 MHzSize: 110K transistors, 80 mm2

Ratio of

-- H&P, Appendix J, from Bhandarkar and Clark, 1991

Performance Ratio

Instructions Executed Ratio

CPI Ratio

eqntott li

2x more instr

6x lower CPI

2-4x higher perf

CISC/RISC Convergence

by Linley Gwennap

Not to be left out in the move to thenext generation of RISC, MIPS Tech-nologies (MTI) unveiled the design ofthe R10000, also known as T5. As thespiritual successor to the R4000, thenew design will be the basis of high-end

MIPS processors for some time, at least until 1997. Byswapping superpipelining for an aggressively out-of-order superscalar design, the R10000 has the potentialto deliver high performance throughout that period.

The new processor uses deep queues decouple theinstruction fetch logic from the execution units. Instruc-tions that are ready to execute can jump ahead of thosewaiting for operands, increasing the utilization of the ex-ecution units. This technique, known as out-of-order ex-ecution, has been used in PowerPC processors for sometime (see 081402.PDF ), but the new MIPS design is themost aggressive implementation yet, allowing more in-structions to be queued than any of its competitors.

Taking advantage of its experience with the 200-MHz R4400, MTI was able to streamline the design andexpects it to run at a high clock rate. Speaking at theMicroprocessor Forum, MTI’s Chris Rowen said that thefirst R10000 processors will reach a speed of 200 MHz,50% faster than the PowerPC 620. At this speed, he ex-pects performance in excess of 300 SPECint92 and 600SPECfp92, challenging Digital’s 21164 for the perfor-mance lead. Due to schedule slips, however, the R10000has not yet taped out; we do not expect volume ship-ments until 4Q95, by which time Digital may enhancethe performance of its processor.

Speculative Execution Beyond BranchesThe front end of the processor is responsible for

maintaining a continuous flow of instructions into thequeues, despite problems caused by branches and cachemisses. As Figure 1 shows, the chip uses a two-way set-associative instruction cache of 32K. Like other highlysuperscalar designs, the R10000 predecodes instructionsas they are loaded into this cache, which holds four extra

bits per instruction. These bits reducethe time needed to determine the ap-propriate queue for each instruction.

The processor fetches four instruc-tions per cycle from the cache and de-codes them. If a branch is discovered, itis immediately predicted; if it is pre-dicted taken, the target address is sentto the instruction cache, redirecting thefetch stream. Because of the one cycleneeded to decode the branch, takenbranches create a “bubble” in the fetchstream; the deep queues, however, gen-erally prevent this bubble from delay-ing the execution pipeline.

The sequential instructions thatare loaded during this extra cycle arenot discarded but are saved in a “re-sume” cache. If the branch is later de-termined to have been mispredicted, thesequential instructions are reloadedfrom the resume cache, reducing themispredicted branch penalty by onecycle. The resume cache has four entriesof four instructions each, allowing spec-ulative execution beyond four branches.

The R10000 design uses the stan-dard two-bit Smith method to predict

M I C R O P R O C E S S O R R E P O R T

MIPS R10000 Uses Decoupled ArchitectureHigh-Performance Core Will Drive MIPS High-End for Years

1 9 9 4

FORUMMICROPROCESSOR

Figure 1. The R10000 uses deep instruction queues to decouple the instruction fetch logicfrom the five function units.

Instruction Cache32K, two-way associative

Predecode

ITLB8 entry

Decode, Map,

DispatchActive

ListMapTable

Main TLB64 entries

Data Cache32K, two-way associative

4 instr

MemoryQueue

16 entries

IntegerQueue16 entries

FPQueue16 entries

ALU2 FP

Mult÷!

virtualaddr

phys addr

DataSRAM

512K-16M

nche B

ache Inte

m Inte

TagSRAM

BHT512 x 2

Resume

Address

Integer Registers64 ! 64 bits

FP Registers64 ! 64 bits

MIPS R10K uses sophisticatedout-of-order engine; branch

delay slot not useful

– Gwennap, MPR, 1994

Intel Nehalem frontend breaks x86 CISCinto smaller RISC-like µops; µcode engine

handles rarely used complex instr

– Kanter, Real World Technologies, 2009

Microprogamming Today

• Microprogramming is far from extinct

• Played a crucial role in microprocessors of the 1980s(DEC VAX, Motorola 68K series, Intel 386/486)

• Microprogramming plays assisting role in many modern processors(AMD Phenom, Intel Nehalem, Intel Atom, IBM Z196)

– 761 Z196 instructions executed with hardwired control– 219 Z196 “complex” instructions always executed with microcode– 24 Z196 instructions conditionally executed with microcode

• Patchable microcode common for post-fabrication bug fixes (Intelprocessors load µcode patches at bootup)

ece 4750 computer architecture, fall 2019 t02 fundamental ...€¦ · t02 fundamental processor...

Documents

computer fundamental - (chapter 01 - introduction to...

computer introduction (fundamental computer)

computer networking fundamental

computer fundamental notes

computer fundamental introduction_and_types

computer fundamental mba

computer fundamental assignment

fundamental of computer

computer fundamental -...

computer fundamental pdf

computer fundamental

คอมพิวเตอร์พื้นฐาน...

fundamental computer

computer fundamental (presentation)

ece 4750 computer architecture, fall 2014 t02 fsm...

computer fundamental mcq

computer fundamental & networking

ece 4750 computer architecture, fall 2019 t02 ......ece 4750...

t02 shadow

1. computer fundamental