ece 4750 computer architecture, fall 2019 t02 fundamental ...€¦ · t02 fundamental processor...
Post on 13-Jun-2020
10 Views
Preview:
TRANSCRIPT
ECE 4750 Computer Architecture, Fall 2019
T02 Fundamental Processor Microarchitecture
School of Electrical and Computer EngineeringCornell University
revision: 2019-09-10-15-43
1 Processor Microarchitectural Design Patterns 4
1.1. Transactions and Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Microarchitecture: Control/Datapath Split . . . . . . . . . . . . . . 5
2 TinyRV1 Single-Cycle Processor 6
2.1. High-Level Idea for Single-Cycle Processors . . . . . . . . . . . . . . 7
2.2. Single-Cycle Processor Datapath . . . . . . . . . . . . . . . . . . . . 8
2.3. Single-Cycle Processor Control Unit . . . . . . . . . . . . . . . . . . 14
2.4. Analyzing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 TinyRV1 FSM Processor 17
3.1. High-Level Idea for FSM Processors . . . . . . . . . . . . . . . . . . 18
3.2. FSM Processor Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3. FSM Processor Control Unit . . . . . . . . . . . . . . . . . . . . . . . 25
3.4. Analyzing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 TinyRV1 Pipelined Processor 31
4.1. High-Level Idea for Pipelined Processors . . . . . . . . . . . . . . . 32
4.2. Pipelined Processor Datapath and Control Unit . . . . . . . . . . . . 34
1
5 Pipeline Hazards: RAW Data Hazards 39
5.1. Expose in Instruction Set Architecture . . . . . . . . . . . . . . . . . 41
5.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3. Hardware Bypassing/Forwarding . . . . . . . . . . . . . . . . . . . 43
5.4. RAW Data Hazards Through Memory . . . . . . . . . . . . . . . . . 47
6 Pipeline Hazards: Control Hazards 48
6.1. Expose in Instruction Set Architecture . . . . . . . . . . . . . . . . . 50
6.2. Hardware Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3. Interrupts and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . 54
7 Pipeline Hazards: Structural Hazards 59
7.1. Expose in Instruction Set Architecture . . . . . . . . . . . . . . . . . 60
7.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3. Hardware Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8 Pipeline Hazards: WAW and WAR Name Hazards 62
8.1. Software Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9 Summary of Processor Performance 65
10 Case Study: Transition from CISC to RISC 69
10.1. Example CISC: IBM 360/M30 . . . . . . . . . . . . . . . . . . . . . . 70
10.2. Example RISC: MIPS R2K . . . . . . . . . . . . . . . . . . . . . . . . 73
Copyright © 2016 Christopher Batten. All rights reserved. This handout was originallyprepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS 4420. It has
2
been updated in 2017-2019 by Prof. Christina Delimitrou. Download and use of this hand-out is permitted for individual educational non-commercial purposes only. Redistributioneither in part or in whole via both commercial or non-commercial means requires writtenpermission.
3
1. Processor Microarchitectural Design Patterns 2.0. Transactions and Steps
1. Processor Microarchitectural Design Patterns
TimeProgram
=Instructions
Program× Avg Cycles
Instruction× Time
Cycle
• Instructions / program depends on source code, compiler, ISA• Avg cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation
Microarchitecture CPI Cycle Time
Single-Cycle Processor 1 longFSM Processor >1 shortPipelined Processor ≈1 short
1.1. Transactions and Steps
• We can think of each instruction as a transaction• Executing a transaction involves a sequence of steps
add addi mul lw sw jal jr bne
Fetch Instruction 3 3 3 3 3 3 3 3
Decode Instruction 3 3 3 3 3 3 3 3
Read Registers 3 3 3 3 3 3 3
Register Arithmetic 3 3 3 3 3 3
Read Memory 3
Write Memory 3
Write Registers 3 3 3 3 3
Update PC 3 3 3 3 3 3 3 3
3
1. Processor Microarchitectural Design Patterns1.2. Microarchitecture: Control/Datapath Split
1.2. Microarchitecture: Control/Datapath Split
Control Signals Status Signals
Control Unit
Datapath
imemreq_val
imemreq
imemresp
dmemreq_val
dmemreq
dmemresp
Memory
4
2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors
2. TinyRV1 Single-Cycle Processor
TimeProgram
=Instructions
Program× Avg Cycles
Instruction× Time
Cycle
• Instructions / program depends on source code, compiler, ISA• Avg cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation
Microarchitecture CPI Cycle Time
Single-Cycle Processor 1 longFSM Processor >1 shortPipelined Processor ≈1 short
Technology Constraints
• Assume technology wherelogic is not too expensive, sowe do not need to overlyminimize the number ofregisters and combinationallogic
• Assume multi-ported registerfile with a reasonable numberof ports is feasible
• Assume a dual-portedcombinational memory
Control Status
Control Unit
Datapath
<1 cyclecombinational
Memory
regfile
imemreq
imemresp
dmemreq
dmemresp
5
2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors
2.1. High-Level Idea for Single-Cycle Processors
add addi mul lw sw jal jr bne
Fetch Instruction 3 3 3 3 3 3 3 3
Decode Instruction 3 3 3 3 3 3 3 3
Read Registers 3 3 3 3 3 3 3
Register Arithmetic 3 3 3 3 3 3
Read Memory 3
Write Memory 3
Write Registers 3 3 3 3 3
Update PC 3 3 3 3 3 3 3 3
FetchInst
lw
DecodeInst
RegArith
ReadMem
WriteReg
ReadReg
UpdatePC
FetchInst
add
DecodeInst
RegArith
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
UpdatePC
jal
Single-Cycle
WriteReg
6
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
2.2. Single-Cycle Processor Datapath
pc
regfile(read)
regfile(write)
0000000 000rs1rs2 rd 0110011
067111214151920242531ADD
add rd, rs1, rs2
R[rd] ← R[rs1] + R[rs2]PC ← PC + 4
pc
regfile(read)
regfile(write)
000rs1 rd 0010011
06711121415192031
imm
ADDI
addi rd, rs1, imm
R[rd] ← R[rs1] + sext(imm)PC ← PC + 4
7
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
Implementing ADD and ADDI Instructions
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
0000001 000rs1rs2 rd 0110011
067111214151920242531MUL
mul rd, rs1, rs2
R[rd] ← R[rs1] × R[rs2]PC ← PC + 4
8
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
010rs1 rd 0000011
06711121415192031
imm
LW
lw rd, imm(rs1)
R[rd] ← M[ R[rs1] + sext(imm) ]PC ← PC + 4
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
dmemreq.data
imm_type
rf_wen
010rs1 imm 0100011
06711121415192031
imm rs2
2425SW
sw rs2, imm(rs1)
M[ R[rs1] + sext(imm) ] ← R[rs2]PC ← PC + 4
imm = { inst[31:25], inst[11:7] }
9
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
dmemreq.data
imm_type
rf_wen
+pc_sel
jalbr_targ
alu_func
rd 1101111
067111231
imm
JAL
jal rd, imm
R[rd] ← PC + 4PC ← PC + sext(imm)
imm = { inst[31], inst[19:12],inst[20], inst[30:21], 0 }
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
dmemreq.data
imm_type
rf_wen
+
pc_sel
jalbr_targ
alu_func
jr_targ
000rs1 1100111
06711121415192031
00000000000000000
JR
jr rs1
PC ← R[rs1]
10
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
dmemreq.data
imm_type
rf_wen
+pc_sel
jalbr_targ
alu_func
jr_targ
eq
001rs1 1100011
06711121415192031
rs2
2425
immimm
BNE
bne rs1, rs2, imm
if ( R[rs1] != R[rs2] ) PC ← PC + sext(imm)else PC ← PC + 4
imm = { inst[31], inst[7],inst[30:25], inst[11:8], 0 }
11
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
Adding a New Auto-Incrementing Load Instruction
Draw on the datapath diagram what paths we need to use as well asany new paths we will need to add in order to implement the followingauto-incrementing load instruction.
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
dmemreq.data
imm_type
rf_wen
+
pc_sel
jalbr_targ
alu_func
jr_targ
eq
000rs1 rd 0001011
06711121415192031
imm
LW.AI
lw.ai rd, imm(rs1)
R[rd] ← M[ R[rs1] + sext(imm) ]R[rs1] ← R[rs1] + 4PC ← PC + 4
12
2. TinyRV1 Single-Cycle Processor 2.3. Single-Cycle Processor Control Unit
2.3. Single-Cycle Processor Control Unit
imem dmempc imm op1 alu wb rf req req
inst sel type sel func sel wen val val
add pc+4 – rf + alu 1 1 0
addi
mul pc+4 – – – mul 1 1 0
lw pc+4 i imm + mem 1 1 1
sw
jal
jr jr i – – – 0 1 0
bne
Need to factor eq status signal into pc_sel signal for BNE!
2.4. Analyzing Performance
TimeProgram
=Instructions
Program× Cycles
Instruction× Time
Cycles
• Instructions / program depends on source code, compiler, ISA• Cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation
13
2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance
Estimating cycle time
There are many paths through the design that start at a state elementand end at a state element. The “critical path” is the longest path acrossall of these paths. We can usually use a simple first-order static timingestimate to estimate the cycle time (i.e., the clock period and thus alsothe clock frequency).
pc
regfile(read)
regfile(write)
pc_plus4
+4
alu
ir[24:20]
ir[19:15]
imemreq.addr
To control unit
imemresp.data
ir[11:7]
op2_sel
immgenir[31:7]
mu
l
wb_sel
dmemreq.addr
dmemresp.data
dmemreq.data
imm_type
rf_wen
+
pc_sel
jalbr_targ
alu_func
jr_targ
eq
• register read = 1τ
• register write = 1τ
• regfile read = 10τ
• regfile write = 10τ
• memory read = 20τ
• memory write = 20τ
• +4 unit = 4τ
• immgen = 2τ
• mux = 3τ
• multiplier = 20τ
• alu = 10τ
• adder = 8τ
14
2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance
Estimating execution time
Using our first-order equation for processor performance, how long innanoseconds will it take to execute the vector-vector add exampleassuming n is 64?
loop:lw x5, 0(x13)lw x6, 0(x14)add x7, x5, x6sw x7, 0(x12)addi x13, x12, 4addi x14, x14, 4addi x12, x12, 4addi x15, x15, -1bne x15, x0, loopjr x1
Using our first-order equation for processor performance, how long innanoseconds will it take to execute the mystery program assuming n is64 and that we find a match on the last element.
addi x5, x0, 0loop:lw x6, 0(x12)bne x6, x14, fooaddi x10, x5, 0jr x1
foo:addi x12, x12, 4addi x5, x5, 1bne x5, x13, loopaddi x10, x0, -1jr x1
15
3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors
3. TinyRV1 FSM Processor
TimeProgram
=Instructions
Program× Avg Cycles
Instruction× Time
Cycle
• Instructions / program depends on source code, compiler, ISA• Avg cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation
Microarchitecture CPI Cycle Time
Single-Cycle Processor 1 longFSM Processor >1 shortPipelined Processor ≈1 short
Technology Constraints
• Assume legacy technologywhere logic is expensive, sowe want to minimize thenumber of registers andcombinational logic
• Assume an (unrealistic)combinational memory
• Assume multi-ported registerfiles and memories are tooexpensive, these structurescan only have a singleread/write port
Control Status
Control Unit
Datapath
<1 cyclecombinational
Memory
regfile
memreq
memresp
16
3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors
3.1. High-Level Idea for FSM Processors
add addi mul lw sw jal jr bne
Fetch Instruction 3 3 3 3 3 3 3 3
Decode Instruction 3 3 3 3 3 3 3 3
Read Registers 3 3 3 3 3 3 3
Register Arithmetic 3 3 3 3 3 3
Read Memory 3
Write Memory 3
Write Registers 3 3 3 3 3
Update PC 3 3 3 3 3 3 3 3
FetchInst
lw
DecodeInst
RegArith
ReadMem
WriteReg
ReadReg
UpdatePC
FetchInst
add
DecodeInst
RegArith
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
UpdatePC
jal
Single-Cycle
WriteReg
jal
FetchInst
lw
DecodeInst
RegArith
ReadMem
WriteReg
ReadReg
UpdatePC
FetchInst
add
DecodeInst
RegArith
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
UpdatePCF
SM
WriteReg
3.2. FSM Processor Datapath
Implementing an FSM datapath requires thinking about the requiredFSM states, but we will defer discussion of how to implement thecontrol logic to the next section.
17
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
Implementing Fetch Sequence
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
F0
F1
F2
(pseudo-control-signal syntax)
18
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
Implementing ADD Instruction
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
F0
F1
F2
A0
A1
A2
(pseudo-control-signal syntax)add rd, rs1, rs2
19
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
Full Datapath for TinyRV1 FSM Processor
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
sextimm
immgen_bus_en
>>
b_sel
>>C
c_selc_en
x0
A +? B+?:
memreq.data
WD
wd_en
imm_type
imm gen
immgen
A == Bcmp:
sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:
alu_bus_en
eq
A − 4−4:
F0
F1
F2
A0
A1
A2
AI0
AI1
AI2
M0
M1
M2
M35
M3
L0
L1
L2
L3
S0
S1
S2
S3
JA0
JA1
JA2
JR0 B0
B1
B2
B3
B4
B5
ADDI Pseudo-Control-SignalFragment
addi rd, rs1, imm
20
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
MUL Instruction
mul rd, rs1, rs2
M0: A← RF[x0]M1: B← RF[rs1]M2: C← RF[rs2]M3: A← A +? B;
B← B << 1; C← C >> 1M4: A← A +? B;
B← B << 1; C← C >> 1...
M35: RF[rd]← A +? B; goto F0
LW Instruction
lw rd, imm(rs1)
L0: A← RF[rs1]L1: B← sext(imm_i)L2: memreq.addr← A + BL3: RF[rd]← RD; goto F0
SW Instruction
sw rs2, imm(rs1)
S0: WD← RF[rs2]S1: A← RF[rs1]S2: B← sext(imm_s)S3: memreq.addr← A + B; goto F0
JAL Instruction
jal rd, imm
JA0: RF[rd]← PCJA1: B← sext(imm_j)JA2: PC← A + B; goto F0
JR Instruction
jr rs1
JR0: PC← RF[rs1]; goto F0
BNE Instruction
bne rs1, rs2, imm
B0: A← RF[rs1]B1: B← RF[rs2]B2: B← sext(imm_b);
if A == B goto F0B3: A← PCB4: A← A − 4B5: PC← A + B; goto F0
21
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
Adding a Complex Instruction
FSM processors simplify adding complex instructions. New instructionsusually do not require datapath modifications, only additional states.
add.mm rd, rs1, rs2
M[ R[rd] ]←M[ R[rs1] ] + M[ R[rs2] ]
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
sextimm
immgen_bus_en
>>
b_sel
>>C
c_selc_en
x0
A +? B+?:
memreq.data
WD
wd_en
imm_type
imm gen
immgen
A == Bcmp:
sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:
alu_bus_en
eq
A − 4−4:
22
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
Adding a New Auto-Incrementing Load Instruction
Implement the following auto-incrementing load instruction usingpseudo-control-signal syntax. Modify the datapath if necessary.
lw.ai rd, imm(rs1)
R[rd]←M[ R[rs1] + sext(imm_i) ]; R[rs1]← R[rs1] + 4
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
sextimm
immgen_bus_en
>>
b_sel
>>C
c_selc_en
x0
A +? B+?:
memreq.data
WD
wd_en
imm_type
imm gen
immgen
A == Bcmp:
sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:
alu_bus_en
eq
A − 4−4:
23
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
3.3. FSM Processor Control Unit
F0
F1
F2
A0
A1
A2
AI0
AI1
AI2
M0
M1
M2
M35
M3
L0
L1
L2
L3
S0
S1
S2
S3
JA0
JA1
JA2
JR0 B0
B1
B2
B3
B4
B5
We will study three techniquesfor implementing FSM controlunits:
• Hardwired control units arehigh-performance, butinflexible
• Horizontal µcodingincreases flexibility, requireslarge control store
• Vertical µcoding is anintermediate design point
Hardwired FSM
State
ControlSignalLogic
StateTransition
Logic
Control Signals(22)
Status Signals(1)
24
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
Control signal output table for hardwired control unit
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
sextimm
immgen_bus_en
>>
b_sel
>>C
c_selc_en
x0
A +? B+?:
memreq.data
WD
wd_en
imm_type
imm gen
immgen
A == Bcmp:
sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:
alu_bus_en
eq
A − 4−4:
F0: memreq.addr← PC; A← PCF1: IR← RDF2: PC← A + 4; goto inst
A0: A← RF[rs1]A1: B← RF[rs2]A2: RF[rd]← A + B; goto F0
Bus Enables Register Enables Mux Func RF MReq
state pc ig alu rf rd pc ir a b c wd b c ig alu sel wen val op
F0 1 0 0 0 0 0 0 1 0 0 0 – – – – – 0 1 r
F1 0 0 0 0 1 0 1 0 0 0 0 – – – – – 0 0 –
F2 0 0 1 0 0 1 0 0 0 0 0 – – – +4 – 0 0 –
A0
A1
A2
25
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
Vertically Microcoded FSM
uPC
ControlSignals
NextState
+1
decoderopcodeF0
eq
Next State Encoding
n : increment uPC by oned : dispatch based on opcodef : goto state F0b : goto state F0 if A == B
Control Signals (22)
busen
muxsel
Status Signals(1)
• Use memory array (called the control store) instead of random logicto encode both the control signal logic and the state transition logic
• Enables a more systematic approach to implementing complexmulti-cycle instructions
• Microcoding can produce good performance if accessing the controlstore is much faster than accessing main memory
• Read-only control stores might be replaceable enabling in-fieldupdates, while read-write control stores can simplify diagnosticsand microcode patches
26
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
Control signal store for microcoded control unit
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
sextimm
immgen_bus_en
>>
b_sel
>>C
c_selc_en
x0
A +? B+?:
memreq.data
WD
wd_en
imm_type
imm gen
immgen
A == Bcmp:
sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:
alu_bus_en
eq
A − 4−4:
B0: A← RF[rs1]B1: B← RF[rs2]B2: B← sext(imm_b); if A == B goto F0
B3: A← PCB4: A← A − 4B5: PC← A + B; goto F0
Bus Enables Register Enables Mux Func RF MReq
state pc ig alu rf rd pc ir a b c wd b c ig alu sel wen val op next
B0 0 0 0 1 0 0 0 1 0 0 0 – – – – rs1 0 0 –
B1 0 0 0 1 0 0 0 0 1 0 0 b – – – rs2 0 0 –
B2 0 1 0 0 0 0 0 0 1 0 0 b – b cmp – 0 0 –
B3 1 0 0 0 0 0 0 1 0 0 0 – – – – – 0 0 –
B4 0 0 1 0 0 0 0 1 0 0 0 – – – −4 – 0 0 –
B5 0 0 1 0 0 1 0 0 0 0 0 – – – + – 0 0 –
27
3. TinyRV1 FSM Processor 3.4. Analyzing Performance
3.4. Analyzing Performance
TimeProgram
=Instructions
Program× Cycles
Instruction× Time
Cycles
Estimating cycle time
PC IR
ir_enpc_en
Dat
apat
h B
us
pc_bus_en
A
a_en
+4
alu_bus_en
memresp.data
memreq.addr
RD
rd_bus_en
To control unit
B
b_en
RFrf_wen
rf_bus_en
alu_func
alu func
alu
rf_addr_sel
rs1rs2rd
A + 4+4:
A + B+:
sextimm
immgen_bus_en
>>
b_sel
>>C
c_selc_en
x0
A +? B+?:
memreq.data
WD
wd_en
imm_type
imm gen
immgen
A == Bcmp:
sext(IR[31:20])i: sext({IR[31:25],IR[11:7]})s:sext({IR[31],IR[7],IR[30:25],IR[11:8],0})b:sext({IR[31],IR[19:12],IR[20],IR[30:21],0})j:
alu_bus_en
eq
A − 4−4:
• register read/write = 1τ
• regfile read/write = 10τ
• mem read/write = 20τ
• immgen = 2τ
• mux = 3τ
• alu = 10τ
• 1b shifter = 1τ
• tri-state buf = 1τ
28
3. TinyRV1 FSM Processor 3.4. Analyzing Performance
Estimating execution time
Using our first-order equation for processor performance, how long inunits of τ will it take to execute the vector-vector add exampleassuming n is 64?
loop:lw x5, 0(x13)lw x6, 0(x14)add x7, x5, x6sw x7, 0(x12)addi x13, x12, 4addi x14, x14, 4addi x12, x12, 4addi x15, x15, -1bne x15, x0, loopjr x1
Using our first-order equation for processor performance, how long inunits of τ will it take to execute the mystery program assuming n is 64and that we find a match on the last element.
addi x5, x0, 0loop:lw x6, 0(x12)bne x6, x14, fooaddi x10, x5, 0jr x1
foo:addi x12, x12, 4addi x5, x5, 1bne x5, x13, loopaddi x10, x0, -1jr x1
29
4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors
4. TinyRV1 Pipelined Processor
TimeProgram
=Instructions
Program× Avg Cycles
Instruction× Time
Cycle
• Instructions / program depends on source code, compiler, ISA• Avg cycles / instruction (CPI) depends on ISA, microarchitecture• Time / cycle depends upon microarchitecture and implementation
Microarchitecture CPI Cycle Time
Single-Cycle Processor 1 longFSM Processor >1 shortPipelined Processor ≈1 short
Technology Constraints
• Assume modern technologywhere logic is cheap and fast(e.g., fast integer ALU)
• Assume multi-ported registerfiles with a reasonablenumber of ports are feasible
• Assume small amount of veryfast memory (caches) backedby large, slower memory
Control Status
Control Unit
Datapath
<1 cyclecombinational
Memory
regfile
imemreq
imemresp
dmemreq
dmemresp
30
4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors
4.1. High-Level Idea for Pipelined Processors
• Anne, Brian, Cathy, and Dave each have one load of clothes• Washing, drying, folding, and storing each take 30 minutes
7pm 8pm 9pm 10pm 11pm 12am 1am 2am 3am
Anne'sLoad
Ben'sLoad
Cathy'sLoad
Dave'sLoad
Pipelined Laundry with Slow Dryers
Anne'sLoad
7pm 8pm 9pm
Ben'sLoad
Cathy'sLoad
Dave'sLoad
7pm 8pm 9pm 10pm 11pm
Anne'sLoad
Ben'sLoad
Cathy'sLoad
Dave'sLoad
Fixed Time-Slot Laundry
Pipelined Laundry10pm 12am
Pipelining lessons
• Multiple transactions operate simultaneously using different resources• Pipelining does not help the transaction latency• Pipelining does help the transaction throughput• Potential speedup is proportional to the number of pipeline stages• Potential speedup is limited by the slowest pipeline stage• Potential speedup is reduced by time to fill the pipeline
31
4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors
Applying pipelining to processors
add addi mul lw sw jal jr bne
Fetch Instruction 3 3 3 3 3 3 3 3
Decode Instruction 3 3 3 3 3 3 3 3
Read Registers 3 3 3 3 3 3 3
Register Arithmetic 3 3 3 3 3 3
Read Memory 3
Write Memory 3
Write Registers 3 3 3 3 3
Update PC 3 3 3 3 3 3 3 3
FetchInst
lw
DecodeInst
RegArith
ReadMem
WriteReg
ReadReg
UpdatePC
FetchInst
add
DecodeInst
RegArith
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
UpdatePC
jal
Single-Cycle
WriteReg
jal
FetchInst
lw
DecodeInst
RegArith
ReadMem
WriteReg
ReadReg
UpdatePC
FetchInst
add
DecodeInst
RegArith
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
UpdatePCF
SM
WriteReg
FetchInst
lw
DecodeInst
RegArith
ReadMem
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
RegArith
WriteReg
ReadReg
UpdatePC
FetchInst
DecodeInst
UpdatePC
add
jal
Pipelined
WriteReg
32
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
4.2. Pipelined Processor Datapath and Control Unit• Incrementally develop an unpipelined datapath• Keep data flowing from left to right• Position control signal table early in the diagram• Divide datapath/control into stages by inserting pipeline registers• Keep the pipeline stages roughly balanced• Forward arrows should avoid “skipping” pipeline registers• Backward arrows will need careful consideration
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgenalu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
imemreq.addr
imemresp.data
dmemreq.addr
dmemreq.data
dmemresp.data
F D X M W
F D X M W
F D X M W
addi x1, x2, 1
addi x3, x4, 1
addi x5, x6, 1
33
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Adding a new auto-incrementing load instruction
Draw on the above datapath diagram what paths we need to use as wellas any new paths we will need to add in order to implement thefollowing auto-incrementing load instruction.
lw.ai rd, imm(rs1)
R[rd]←M[ R[rs1] + sext(imm) ]; R[rs1]← R[rs1] + 4
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgenalu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
34
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Pipeline diagrams
addi x1, x2, 1
addi x3, x4, 1
addi x5, x6, 1
What would be the total execution time if these three instructions wererepeated 10 times?
Hazards occur when instructions interact with each other in pipeline
• RAW Data Hazards: An instruction depends on a data valueproduced by an earlier instruction
• Control Hazards: Whether or not an instruction should be executeddepends on a control decision made by an earlier instruction
• Structural Hazards: An instruction in the pipeline needs a resourcebeing used by another instruction in the pipeline
• WAW and WAR Name Hazards: An instruction in the pipeline iswriting a register that an earlier instruction in the pipeline is eitherwriting or reading
Stalling and squashing instructions
• Stalling: An instruction originates a stall due to a hazard, causing allinstructions earlier in the pipeline to also stall. When the hazard isresolved, the instruction no longer needs to stall and the pipelinestarts flowing again.
• Squashing: An instruction originates a squash due to a hazard, andsquashes all previous instructions in the pipeline (but not itself). Werestart the pipeline to begin executing a new instruction sequence.
35
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Control logic with no stalling and no squashing
Stage ADatapathLogic
Stage BDatapathLogic
Stage CDatapathLogic
Stage AControlLogic
Stage BControlLogic
Stage CControlLogic
always_ff @( posedge clk )if ( reset )
val_B <= 0else
val_B <= next_val_A
next_val_B = val_B
Control logic with stalling and no squashing
Stage AControlLogic
Stage ADatapath
Logic
Stage BControlLogic
Stage CControlLogic
Stage BDatapath
Logic
Stage CDatapath
Logic
control, ostall signals
reg_en_B
val_Bnext_val_B
reg_en_B = !stall_B
always_ff @( posedge clk )if ( reset )
val_B <= 0else if ( reg_en_B )
val_B <= next_val_A
ostall_B = val_B && ( ostall_hazard1_B || ostall_hazard2_B )
stall_B = val_B && ( ostall_B || ostall_C || ... )
next_val_B = val_B && !stall_B
ostall_B Originating stall due to hazards detected in B stage.
stall_B Should we actually stall B stage? Factors in ostalls due to hazardsand ostalls from later pipeline stages.
next_val_B Only send transaction to next stage if transaction in B stage is validand we are not stalling B stage.
36
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Control logic with stalling and squashing
Stage AControlLogic
Stage ADatapath
Logic
Stage BControlLogic
Stage CControlLogic
Stage BDatapath
Logic
Stage CDatapath
Logic
control, ostall signals
reg_en_B
val_Bnext_val_B
control, ostall, osquash signals
reg_en_B = !stall_B|| squash_B
always_ff @( posedge clk )if ( reset )
val_B <= 0else if ( reg_en_B )
val_B <= next_val_A
ostall_B = val_B && ( ostall_hazard1_B || ostall_hazard2_B )
stall_B = val_B && ( ostall_B || ostall_C || ... )
osquash_B = val_B && !stall_B && ( osquash_hazard1_B || ... )
squash_B = val_B && ( osquash_C || ... )
next_val_B = val_B && !stall_B && !squash_B
ostall_B Originating stall due to hazards detected in B stage.
stall_B Should we actually stall B stage? Factors in ostalls due to hazardsand ostalls from later pipeline stages.
osquash_B Originating squash due to hazards detected in B stage. If this stageis stalling, do not originate a squash.
squash_B Should we squash B stage? Factors in the originating squashesfrom later pipeline stages. An originating squash from B stagemeans to squash all stages earlier than B, so osquash_B is notfactored into squash_B.
next_val_B Only send transaction to next stage if transaction in B stage is validand we are not stalling or squashing B stage.
37
5. Pipeline Hazards: RAW Data Hazards
5. Pipeline Hazards: RAW Data Hazards
RAW data hazards occur when one instruction depends on a data valueproduced by a preceding instruction still in the pipeline. We usearchitectural dependency arrows to illustrate RAW dependencies inassembly code sequences.
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
Using pipeline diagrams to illustrate RAW hazards
We use microarchitectural dependency arrows to illustrate RAWhazards on pipeline diagrams.
F D X M W
F D X M W
F D X M W
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
38
5. Pipeline Hazards: RAW Data Hazards
Approaches to resolving data hazards
• Expose in Instruction Set Architecture: Expose data hazards in ISAforcing compiler to explicitly avoid scheduling instructions thatwould create hazards (i.e., software scheduling for correctness)
• Hardware Scheduling: Hardware dynamically schedulesinstructions to avoid RAW hazards, potentially allowinginstructions to execute out of order
• Hardware Stalling: Hardware includes control logic that freezeslater instructions until earlier instruction has finished producingdata value; software scheduling can still be used to avoid stalling(i.e., software scheduling for performance)
• Hardware Bypassing/Forwarding: Hardware allows values to besent from an earlier instruction to a later instruction before theearlier instruction has left the pipeline (sometimes called forwarding)
• Hardware Speculation: Hardware guesses that there is no hazardand allows later instructions to potentially read invalid data; detectswhen there is a problem, squashes and then re-executes instructionsthat operated on invalid data
39
5. Pipeline Hazards: RAW Data Hazards 5.1. Expose in Instruction Set Architecture
5.1. Expose in Instruction Set Architecture
Insert nops to delay read of earlierwrite. These nops count as realinstructions increasinginstructions per program.
addi x1, x2, 1nopnopnopaddi x3, x1, 1nopnopnopaddi x4, x3, 1
Insert independent instructions todelay read of earlier write, andonly use nops if there is notenough useful work
addi x1, x2, 1addi x6, x7, 1addi x8, x9, 1nopaddi x3, x1, 1nopnopnopaddi x4, x3, 1
Pipeline diagram showing exposing RAW data hazards in the ISA
addi x1, x2, 1
addi x6, x7, 1
addi x8, x9, 1
nop
addi x3, x1, 1
nop
nop
nop
addi x4, x3, 1
Note: If hazard is exposed in ISA, software scheduling is required forcorrectness! A scheduling mistake can cause undefined behavior.
40
5. Pipeline Hazards: RAW Data Hazards 5.2. Hardware Stalling
5.2. Hardware Stalling
Hardware includes control logic that freezes later instructions (in frontof pipeline) until earlier instruction (in back of pipeline) has finishedproducing data value.
Pipeline diagram showing hardware stalling for RAW data hazards
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
Note: Software scheduling is not required for correctness, but canimprove performance! Programmer or compiler schedules independentinstructions to reduce the number of cycles spent stalling.
Modifications to datapath/control to support hardware stalling
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
imemreq.addr
imemresp.data
dmemreq.addr
dmemreq.data
dmemresp.data
reg_en_D
reg_en_F
CSig TableStall Logic
41
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
Deriving the stall signal
add addi mul lw sw jal jr bne
rs1_en
rs2_en
rf_wen
ostall_waddr_X_rs1_D =val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
ostall_waddr_M_rs1_D =val_D && rs1_en_D && val_M && rf_wen_M
&& (inst_rs1_D == rf_waddr_M) && (rf_waddr_M != 0)
ostall_waddr_W_rs1_D =val_D && rs1_en_D && val_W && rf_wen_W
&& (inst_rs1_D == rf_waddr_W) && (rf_waddr_W != 0)
... similar for ostall signals for rs2 source register ...
ostall_D = val_D&& ( ostall_waddr_X_rs1_D || ostall_waddr_X_rs2_D
|| ostall_waddr_M_rs1_D || ostall_waddr_M_rs2_D|| ostall_waddr_W_rs1_D || ostall_waddr_W_rs2_D )
5.3. Hardware Bypassing/Forwarding
Hardware allows values to be sent from an earlier instruction (in backof pipeline) to a later instruction (in front of pipeline) before the earlierinstruction has left the pipeline. Sometimes called “forwarding”.
42
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
Pipeline diagram showing hardware bypassing for RAW data hazards
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
Adding single bypass path to support limited hardware bypassing
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgenalu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
Deriving the bypass and stall signals
ostall_waddr_X_rs1_D = 0bypass_waddr_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
43
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
Pipeline diagram showing multiple hardware bypass paths
addi x2, x10, 1
addi x2, x11, 1
addi x1, x2, 1
addi x3, x4, 1
addi x5, x3, 1
add x6, x1, x3
sw x5, 0(x1)
jr x6
Adding all bypass path to support full hardware bypassing
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
bypass_from_Xbypass_from_Mbypass_from_W
op2_byp_sel_D
44
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
Handling load-use RAW dependencies
ALU-use latency is only one cycle, but load-use latency is two cycles.
lw x1, 0(x2)
addi x3, x1, 1
lw x1, 0(x2)
addi x3, x1, 1
ostall_load_use_X_rs1_D =val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X == lw)
ostall_load_use_X_rs2_D =val_D && rs2_en_D && val_X && rf_wen_X
&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X == lw)
ostall_D =val_D && ( ostall_load_use_X_rs1_D || ostall_load_use_X_rs2_D )
bypass_waddr_X_rs1_D =val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X != lw)
bypass_waddr_X_rs2_D =val_D && rs2_en_D && val_X && rf_wen_X
&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)&& (op_X != lw)
45
5. Pipeline Hazards: RAW Data Hazards 5.4. RAW Data Hazards Through Memory
Pipeline diagram for simple assembly sequence
Draw a pipeline diagram illustrating how the following assemblysequence would execute on a fully bypassed pipelined TinyRV1processor. Include microarchitectural dependency arrows to illustratehow data is transferred along various bypass paths.
lw x1, 0(x2)
lw x3, 0(x4)
add x5, x1, x3
sw x5, 0(x6)
addi x2, x2, 4
addi x4, x4, 4
addi x6, x6, 4
addi x7, x7, -1
bne x7, x0, loop
5.4. RAW Data Hazards Through Memory
So far we have only studied RAW data hazards through registers, butwe must also carefully consider RAW data hazards through memory.
sw x1, 0(x2)lw x3, 0(x4) # RAW dependency occurs if R[x2] == R[x4]
sw x1, 0(x2)
lw x3, 0(x4)
46
6. Pipeline Hazards: Control Hazards
6. Pipeline Hazards: Control Hazards
Control hazards occur when whether or not an instruction should beexecuted depends on a control decision made by an earlier instructionWe use architectural dependency arrows to illustrate controldependencies in assembly code sequences.
Static Instr Sequence
addi x1, x0, 1jal x0, fooopAopB
foo: addi x2, x3, 1bne x0, x1, baropCopDopE
bar: addi x4, x5, 1
Dynamic Instr Sequence
addi x1, x0, 1jal x0, fooaddi x2, x3, 1bne x0, x1, baraddi x4, x5, 1
Using pipeline diagrams to illustrate control hazards
We use microarchitectural dependency arrows to illustrate controlhazards on pipeline diagrams.
addi x1, x0, 1
jal x0, foo
addi x2, x3, 1
bne x0, x1, bar
addi x4, x5, 1
47
6. Pipeline Hazards: Control Hazards 6.1. Expose in Instruction Set Architecture
The jump resolution latency and branch resolution latency are thenumber of cycles we need to delay the fetch of the next instruction inorder to avoid any kind of control hazard. Jump resolution latency istwo cycles, and branch resolution latency is three cycles.
addi x1, x0, 1
jal x0, foo
addi x2, x3, 1
bne x0, x1, bar
addi x4, x5, 1
Approaches to resolving control hazards
• Expose in Instruction Set Architecture: Expose control hazards inISA forcing compiler to explicitly avoid scheduling instructions thatwould create hazards (i.e., software scheduling for correctness)
• Software Predication: Programmer or compiler converts controlflow into data flow by using instructions that conditionally executebased on a data value
• Hardware Speculation: Hardware guesses which way the controlflow will go and potentially fetches incorrect instructions; detectswhen there is a problem and re-executes instructions that are alongthe correct control flow
• Software Hints: Programmer or compiler provides hints aboutwhether a conditional branch will be taken or not taken, andhardware can use these hints for more efficient hardware speculation
48
6. Pipeline Hazards: Control Hazards 6.1. Expose in Instruction Set Architecture
6.1. Expose in Instruction Set Architecture
Expose branch delay slots as part of the instruction set. Branch delayslots are instructions that follow a jump or branch and are alwaysexecuted regardless of whether a jump or branch is taken or not taken.Compiler tries to insert useful instructions, otherwise inserts nops.
addi x1, x0, 1jal x0, foonopopAopB
foo: addi x2, x3, 1bne x0, x1, barnopnopopCopDopE
bar: addi x4, x5, 1
Assume we modify the TinyRV1instruction set to specify that JAL,and JR instructions have asingle-instruction branch delayslot (i.e., one instruction after aJAL and JR is always executed)and the BNE instruction has atwo-instruction branch delay slot(i.e., two instructions after a BNEare always executed).
Pipeline diagram showing using branch delay slots for control hazards
addi x1, x0, 1
jal x0, foo
nop
addi x2, x3, 1
bne x0, x1, bar
nop
nop
addi x4, x5, 1
49
6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation
6.2. Hardware Speculation
Hardware guesses which way the control flow will go and potentiallyfetches incorrect instructions; detects when there is a problem andre-executes instructions the instructions that are along the correctcontrol flow. For now, we will only consider a simple branch predictionscheme where the hardware always predicts not taken.
Pipeline diagram when branch is not taken
addi x1, x0, 1
jal x0, foo
opA
addi x2, x3, 1
bne x0, x1, bar
opC
opD
Pipeline diagram when branch is taken
addi x1, x0, 1
jal x0, foo
opA
addi x2, x3, 1
bne x0, x1, bar
opC
opD
addi x4, x5, 1
50
6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation
Modifications to datapath/control to support hardware speculation
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
l
op1_sel_D+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
bypass_from_Xbypass_from_Mbypass_from_W
op2_byp_sel_D
pc_sel_F
CSig TableStall, Bypass, &Squash Logic
Deriving the squash signals
osquash_j_D = (op_D == jal) || (op_D == jr)osquash_br_X = (op_X == bne) && !eq_X
Our generic stall/squash scheme gives priority to squashes over stalls.
osquash_D = val_D && !stall_D && osquash_j_Dsquash_D = val_D && osquash_X
osquash_X = val_D && !stall_X && osquash_br_Xsquash_X = 0
Important: PC select logic must give priority to older instructions(i.e., prioritize branches over jumps)! Good quiz question?
51
6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation
Pipeline diagram for simple assembly sequence
Draw a pipeline diagram illustrating how the following assemblysequence would execute on a fully bypassed pipelined TinyRV1processor that uses hardware speculation which always predictsnot-taken. Unlike the “standard” TinyRV1 processor, you should alsoassume that we add a single-instruction branch delay slot to theinstruction set. So this processor will partially expose the controlhazard in the instruction, but also use hardware speculation. Includemicroarchitectural dependency arrows to illustrate both data andcontrol flow.
addi x1, x2, 1bne x0, x3, foo # assume R[rs] != 0addi x4, x5, 1 # instruction is in branch delay slotaddi x6, x7, 1...
foo:add x8, x1, x4addi x9, x1, 1
52
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
6.3. Interrupts and Exceptions
Interrupts and exceptions alter the normal control flow of the program.They are caused by an external or internal event that needs to beprocessed by the system, and these events are usually unexpected orrare from the program’s point of view.
• Asynchronous Interrupts
– Input/output device needs to be serviced– Timer has expired– Power distruption or hardware failure
• Synchronous Exceptions
– Undefined opcode, privileged instruction– Arithmetic overflow, floating-point exception– Misaligned memory access for instruction fetch or data access– Memory protection violation– Virtual memory page faults– System calls (traps) to jump into the operating system kernel
Interrupts and Exception Semantics
• Interrupts are asynchronous with respect to the program, so themicroarchitecture can decide when to service the interrupt
• Exceptions are synchronous with respect to the program, so theymust be handled immediately
53
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
• To handle an interrupt or exception the hardware/software must:
– Stop program at current instruction (I), ensure previous insts finished– Save cause of interrupt or exception in privileged arch state– Save the PC of the instruction I in a special register (EPC)– Switch to privileged mode– Set the PC to the address of either the interrupt or the exception handler– Disable interrupts– Save the user architectural state– Check the type of interrupt or exception
– Handle the interrupt or exception
– Enable interrupts– Switch to user mode– Set the PC to EPC if I should be restarted– Potentially set PC to EPC+4 if we should skip I
Handling a misaligned data address and syscall exceptions
Static Instr Sequence
addi x1, x0, 0x2001lw x2, 0(x1)syscallopBopC...
exception_hander:opD # disable interruptsopE # save user registersopF # check exception typeopG # handle exceptionopH # enable interruptsaddi EPC, EPC, 4eret
Dynamic Instr Sequence
addi x1, x0, 0x2001lw x2, 0(x1) (excep)opDopEopFopGopHaddi EPC, EPC, 4eretsyscall (excep)opDopEopF
54
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
Interrupts and Exceptions in a RISC-V Pipelined Processor
F D X M W
Inst AddressExceptions
IllegalInstruction
ArithmeticOverflow
Data AddressExceptions
• How should we handle a single instruction which generatesmultiple exceptions in different stages as it goes down the pipeline?
– Exceptions in earlier pipeline stages override later exceptions for a giveninstruction
• How should we handle multiple instructions generating exceptionsin different stages at the same or different times?
– We always want the execution to appear as if we have completelyexecuted one instruction before going onto the next instruction
– So we want to process the exception corresponding to the earliestinstruction in program order first
– Hold exception flags in pipeline until commit point
– Commit point is after all exceptions could be generated but before anyarchitectural state has been updated
– To handle an exception at the commit point: update cause and EPC,squash all stages before the commit point, and set PC to exception handler
• How and where to handle external asynchronous interrupts?
– Inject asynchronous interrupts at the commit point
– Asynchronous interrupts will then naturally override exceptions causedby instructions earlier in the pipeline
55
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
Modifications to datapath/control to support exceptions
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
btargjtarg
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
bypass_from_X
bypass_from_M
bypass_from_W
op2_byp_sel_D
pc_sel_F
CSig TableStall, Bypass, &Squash Logic
rf_wen1_W
rf_waddr1_W
xcept_XMxcept_DXxcept_FD
ExceptionHandling Logic
xcepthandler
Deriving the squash signals
osquash_j_D = (op_D == jal) || (op_D == jr)osquash_br_X = (op_X == bne) && !eq_Xosquash_xcept_M = exception_M
Control logic needs to redirect the front end of the pipeline just like for ajump or branch. Again, squashes take priority over stalls, and PC selectlogic must give priority to older instructions (i.e., priortize exceptions,over branches, over jumps)!
56
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
Pipeline diagram of exception handling
addi x1, x0, 0x2001lw x2, 0(x1) # assume causes misaligned address exceptionsyscall # causes a syscall exceptionopBopC...
exception_hander:opD # disable interruptsopE # save user registersopF # check exception typeopG # handle exceptionopH # enable interruptsaddi EPC, EPC, 4eret
57
7. Pipeline Hazards: Structural Hazards
7. Pipeline Hazards: Structural Hazards
Structural hazards occur when an instruction in the pipeline needs aresource being used by another instruction in the pipeline. The TinyRV1processor pipeline is specifically designed to avoid structural hazards.
Let’s introduce a structural hazard by allowing instructions that do notdo any real work in the M stage (i.e., non-memory instructions) toeffectively skip that stage. This would require adding an extra pathwhich “skips over” the pipeline reigster between the X and M stagesand connects directly to the writeback mux at the end of the M stage.For non-memory instructions we set wb_sel_M to choose the value fromthe end of the X stage, while for memory instructions we set wb_sel_Mto choose the value coming back from memory.
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
l
op1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
bypass_from_Xbypass_from_Mbypass_from_W
op2_byp_sel_D
pc_sel_F
CSig TableStall, Bypass, &Squash Logic
58
7. Pipeline Hazards: Structural Hazards 7.1. Expose in Instruction Set Architecture
Using pipeline diagrams to illustrate structural hazards
We use structural dependency arrows to illustrate structural hazards.
addi x1, x2, 1
addi x3, x4, 1
lw x5, 0(x6)
addi x7, x8, 1
Note that the key shared resources that are causing the structuralhazard are the pipeline registers at the end of the M stage. We cannotwrite these pipeline registers with the transaction that is in the X stageand also the transaction that is the M stage at the same time.
Approaches to resolving structural hazards
• Expose in Instruction Set Architecture: Expose structural hazards inISA forcing compiler to explicitly avoid scheduling instructions thatwould create hazards (i.e., software scheduling for correctness)
• Hardware Stalling: Hardware includes control logic that freezeslater instructions until earlier instruction has finished using theshared resource; software scheduling can still be used to avoidstalling (i.e., software scheduling for performance)
• Hardware Duplication: Add more hardware so that each instructioncan access separate resources at the same time
7.1. Expose in Instruction Set ArchitectureInsert independent instructions or nops to delay non-memoryinstructions if they follow a LW or SW instruction.
59
7. Pipeline Hazards: Structural Hazards 7.2. Hardware Stalling
Pipeline diagram showing exposing structural hazards in the ISA
addi x1, x2, 1
addi x3, x4, 1
lw x5, 0(x6)
nop
addi x7, x8, 1
7.2. Hardware StallingHardware includes control logic that stalls a non-memory instruction ifit follows a LW or SW instruction.
Pipeline diagram showing hardware stalling for structural hazards
addi x1, x2, 1
addi x3, x4, 1
lw x5, 0(x6)
addi x7, x8, 1
Deriving the stall signal
ostall_wport_hazard_D = val_D && !mem_inst_D && val_X && mem_inst_X
where mem_inst is true for a LW or SW instruction and false otherwise.Stall far before the structural hazard actually occurs, because we knowexactly how instructions move down the pipeline. Also possible to usedynamic arbitration in the back-end of the pipeline.
60
7. Pipeline Hazards: Structural Hazards 7.3. Hardware Duplication
7.3. Hardware Duplication
Add more pipeline registers at the end of M stage and a second writeport so that non-memory and memory instructions can writeback to theregister file at the same time.
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
lop1_sel_D
+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
mem_result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
non_mem_result_MW
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
bypass_from_Xbypass_from_Mbypass_from_W
op2_byp_sel_D
pc_sel_F
CSig TableStall, Bypass, &Squash Logic
rf_wen1_W
rf_waddr1_W
Does allowing early writeback help performance in the first place?
addi x1, x2, 1
addi x3, x1, 1
addi x4, x3, 1
addi x5, x4, 1
addi x6, x5, 1
addi x7, x6, 1
61
8. Pipeline Hazards: WAW and WAR Name Hazards
8. Pipeline Hazards: WAW and WAR Name Hazards
WAW dependencies occur when an instruction overwrites a registerthan an earlier instruction has already written. WAR dependenciesoccur when an instruction writes a register than an earlier instructionneeds to read. We use architectural dependency arrows to illustrateWAW and WAR dependencies in assembly code sequences.
mul x1, x2, x3
addi x4, x1, 1
addi x1, x5, 1
WAW name hazards occur when an instruction in the pipeline writes aregister before an earlier instruction (in back of the pipeline) has had achance to write that same register.
WAR name hazards occur when an instruction in the pipeline writes aregister before an earlier instuction (in back of pipeline) has had achance to read that same register.
The TinyRV1 processor pipeline is specifically designed to avoid anyWAW or WAR name hazards. Instructions always write the registerfilein-order in the same stage, and instructions always read registers in thefront of the pipeline and write registers in the back of the pipeline.
Let’s introduce a WAW name hazard by using an iterative variablelatency multiplier, and allowing other instructions to continueexecuting while the multiplier is working.
62
8. Pipeline Hazards: WAW and WAR Name Hazards 8.1. Software Renaming
Using pipeline diagrams to illustrate WAW name hazards
We use microarchitectural dependency arrows to illustrate WAWhazards on pipeline diagrams.
mul x1, x2, x3
addi x4, x6, 1
addi x1, x5, 1
Approaches to resolving structural hazards
• Software Renaming: Programmer or compiler changes the registernames to avoid creating name hazards
• Hardware Renaming: Hardware dynamically changes the registernames to avoid creating name hazards
• Hardware Stalling: Hardware includes control logic that freezeslater instructions until earlier instruction has finished either writingor reading the problematic register name
8.1. Software Renaming
As long as we have enough architectural registers, renaming registers insoftware is easy. WAW and WAR dependencies occur because we havea finite number of architectural registers.
mul x1, x2, x3addi x4, x1, 1addi x7, x5, 1
63
9. Summary of Processor Performance 8.2. Hardware Stalling
8.2. Hardware Stalling
Simplest approach is to add stall logic in the decode stage similar towhat the approach used to resolve other hazards.
mul x1, x2, x3
addi x4, x6, 1
addi x1, x5, 1
Deriving the stall signal
ostall_struct_hazard_D = val_D && (op_D == MUL) && !imul_rdy_D
ostall_waw_hazard_D =val_D && rf_wen_D && val_Z && rf_wen_Z
&& (rf_waddr_D == rf_waddr_Z) && (rf_waddr_Z != 0)
9. Summary of Processor Performance
TimeProgram
=Instructions
Program× Cycles
Instruction× Time
Cycles
Results for vector-vector add example
Microarchitecture Inst CPI Cycle Time Exec Time
Single-Cycle Processor 576 1.0 74 τ 43 kτ
FSM Processor 576 6.7 40 τ 154 kτ
Pipelined Processor 576
64
9. Summary of Processor Performance
Estimating cycle time for pipelined processor
pc_plus4
result_sel_X
ir[31:0]
jr
eq_X
wb_sel_M
pc_F +4
regfile(read)ir[24:20]
ir[19:15]
regfile(write)
rf_wen_W
rf_waddr_W
regfile(write)
Control SignalTable
imemreqaddr
imemrespdata
dmemreqaddr
dmemreqdata
dmemrespdata
imm_type
immgen
alu
mu
l
op1_sel_D+
alu_fn_Xop2_sel_D
pc_sel_F
jbtarg
ir_FD
result_XM
F Stage D Stage X Stage M Stage W Stage
cs_DX cs_XM cs_MW
sd_XM
result_XM
result_MW
pc_F pc_FD
val_DX val_XM val_MWval_FD
val_F
ControlLogic
ControlLogic
ControlLogic
op1_DX
sd_DX
btarg_DX
op2_DX
always pc_plus4
btargjtarg
reg_en_D
reg_en_F
CSig TableStall Logic
CSig TableStall & Bypass
Logic
op1_byp_sel_D
bypass_from_Xbypass_from_Mbypass_from_W
op2_byp_sel_D
pc_sel_F
CSig TableStall, Bypass, &Squash Logic
• register read = 1τ
• register write = 1τ
• regfile read = 10τ
• regfile write = 10τ
• memory read = 20τ
• memory write = 20τ
• +4 unit = 4τ
• immgen = 2τ
• mux = 3τ
• multiplier = 20τ
• alu = 10τ
• adder = 8τ
65
9. Summary of Processor Performance
Estimating execution time
Using our first-order equation for processor performance, how long in τwill it take to execute the vvadd example assuming n is 64?
loop:lw x5, 0(x13)lw x6, 0(x14)add x7, x5, x6sw x7, 0(x12)addi x13, x12, 4addi x14, x14, 4addi x12, x12, 4addi x15, x15, -1bne x15, x0, loopjr x1
lw
lw
add
sw
addi
addi
addi
addi
bne
opA
opB
lw
66
9. Summary of Processor Performance
Using our first-order equation for processor performance, how long in τwill it take to execute the mystery program assuming n is 64 and thatwe find a match on the last element.
addi x5, x0, 0loop:lw x6, 0(x12)bne x6, x14, fooaddi x10, x5, 0jr x1
foo:addi x12, x12, 4addi x5, x5, 1bne x5, x13, loopaddi x10, x0, -1jr x1
67
10. Case Study: Transition from CISC to RISC
10. Case Study: Transition from CISC to RISC
• Microcoding thrived in the 1970’s
– ROMs significantly faster than DRAMs– For complex instruction sets, microcode was cheaper and simpler– New instructions supported without modifying datapath– Fixing bugs in controller is easier– ISA compatibility across models relatively straight-forward
— Maurice Wilkes, 1954
68
10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30
10.1. Example CISC: IBM 360/M30
M30 M40 M50 M65
Datapath width (bits) 8 16 32 64µinst width (bits) 50 52 85 87µcode size (1K µinsts) 4 4 2.75 2.75µstore technology CCROS TROS BCROS BCROSµstore cycle (ns) 750 625 500 200Memory cycle (ns) 1500 2500 2000 750Rental fee ($K/month) 4 7 15 35
TROS = transformer read-only storage (magnetic storage)BCROS = balanced capacitor read-only storage (capacitive storage)
CCROS = card capacitor read-only storage (metal punch cards, replace in field)
Only the fastest models (75,95) were hardwired
IBM 360/M30 microprogram for register-register logical OR
OR
Fetch first byteof operands
Instruction Fetch
Writeback Result
Prepare for Next Byte
69
10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30
IBM 360/M30 microprogram for register-register binary ADD
Analyzing Microcoded Machines
• John Cocke and group at IBM
– Working on a simple pipelined processor, 801, and advanced compilers
– Ported experimental PL8 compiler to IBM 370, and only used simpleregister-register and load/store instructions similar to 801
– Code ran faster than other existing compilers that used all 370instructions! (up to 6 MIPS, whereas 2 MIPS considered good before)
• Joel Emer and Douglas Clark at DEC
– Measured VAX-11/780 using external hardware– Found it was actually a 0.5 MIPS machine, not a 1 MIPS machine– 20% of VAX instrs = 60% of µcode, but only 0.2% of the dynamic execution
• VAX 8800, high-end VAX in 1984
– Control store: 16K×147b RAM, Unified Cache: 64K×8b RAM– 4.5×more microstore RAM than cache RAM!
70
10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30
From CISC to RISC
• Key changes in technology constraints
– Logic, RAM, ROM all implemented with MOS transistors– RAM ≈ same speed as ROM
• Use fast RAM to build fast instruction cache of user-visibleinstructions, not fixed hardware microfragments
– Change contents of fast instruction memory to fit what app needs
• Use simple ISA to enable hardwired pipelined implementation
– Most compiled code only used a few of CISC instructions– Simpler encoding allowed pipelined implementations– Load/Store Reg-Reg ISA as opposed to Mem-Mem ISA
• Further benefit with integration
– Early 1980’s→ fit 32-bit datapath, small caches on single chip– No chip crossing in common case allows faster operation
μPC
ROM for
μInst
Small
Decoder
User PC
RAM for
Instr Cache
"Larger"
Decoder
Vertical μCode
ControllerRISC
Controller
71
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
10.2. Example RISC: MIPS R2K
• MIPS R2K is one of the first popularpipelined RISC processors
• MIPS R2K implements the MIPS Iinstruction set
• MIPS = Microprocessor withoutInterlocked Pipeline Stages
MIPSI
MIPSII
MIPSIII
MIPSIV
MIPS64
MIPS32
MIPS16
• MIPS I used software scheduling to avoid some RAW hazards byincluding a single-instruction load-use delay slot
• MIPS I used software scheduling to avoid some control hazards byincluding a single-instruction branch delay slot
One-Instr Branch Delay Slot
addiu r1, r2, 1j fooaddiu r3, r4, 1 # BDS...
foo:addiu r5, r6, 1bne r7, r8, baraddiu r9, r10, 1 # BDS...
bar:
Present in all MIPS instructionsets; not possible to depricate andstill enable legacy code to executeon new microarchitectures
One-Instr Load-Use Delay Slot
lw r1, 0(r2)lw r3, 0(r4)addiu r2, r2, 4 # LDSaddu r5, r1, r3
Deprecated in MIPS II instructionset; legacy code can still executeon new microarchitectures, butcode using the MIPS II instructionset can rely in hardware stalling
72
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
MIPS R2K Microarchitecture
The pipelined datapath and controlwere located on a single die. Cachecontrol and memory management unitwere also integrated on-die, but theactual tag and data storage for thecache was located off-chip.
Control Unit Datapath
I$ Controller D$ Controller
I$ (4-32KB) D$ (4-32KB)
Used two-phase clocking to enable five pipeline stages to fit into fourclock cycles. This avoided the need for explicit bypassing from the Wstage to the end of the D stage. 1.2 The MIPS Five-Stage Pipeline 5
Instruction 1
Instruction 2
Instruction 3
RDIF ALU WB
RD ALU MEM WB
Time
Instruction sequence
IFfrom
I-cache
MEM from
D-cache
RD from
registerfile
WBto
registerfile
ALU
IF
MEM
FIGURE 1.2 MIPS five-stage pipeline.
fetch data from the cache; a cache miss is a relatively rare event and we can juststop the CPU when it happens (though cleverer CPUs find more useful thingsto do).
The MIPS architecture was planned with separate instruction and datacaches, so it can fetch an instruction and read or write a memory variable simul-taneously.
CISC architectures have caches too, but they’re most often afterthoughts,fitted in as a feature of the memory system. A RISC architecture makes moresense if you regard the caches as very much part of the CPU and tied firmly intothe pipeline.
1.2 The MIPS Five-Stage Pipeline
The MIPS architecture is made for pipelining, and Figure 1.2 is close to theearliest MIPS CPUs and typical of many. So long as the CPU runs from thecache, the execution of every MIPS instruction is divided into five phases, calledpipestages, with each pipestage taking a fixed amount of time. The fixed amountof time is usually a processor clock cycle (though some actions take only halfa clock, so the MIPS five-stage pipeline actually occupies only four clockcycles).
All instructions are rigidly defined so they can follow the same sequenceof pipestages, even where the instruction does nothing at some stage. The netresult is that, so long as it keeps hitting the cache, the CPU starts an instructionevery clock cycle.
Two-phase clocking enabled a single-cycle branch resolution latencysince register read, branch address generation, and branch comparisoncan fit in a single cycle.
1.5 MIPS Compared with CISC Architectures 27
IF
IF RD ALU MEM WB
WBMEMALURDIF
RD MEM WBBranch
instruction
Branchdelay
Branchtarget
Branchaddress
FIGURE 1.3 The pipeline and branch delays.
on a MIPS CPU without using some registers). For a program runningin any system that takes interrupts or traps, the values of these registersmay change at any time, so you’d better not use them.
1.5.4 Programmer-Visible Pipeline Effects
So far, this has all been what you might expect from a simplified CPU. However,making the instruction set pipeline friendly has some stranger effects as well,and to understand them we’re going to draw some pictures.
Delayed branches: The pipeline structure of the MIPS CPU (Figure 1.3)means that when a jump/branch instruction reaches the execute phaseand a new program counter is generated, the instruction after the jumpwill already have been started. Rather than discard this potentially usefulwork, the architecture dictates that the instruction after a branch mustalways be executed before the instruction at the target of the branch. Theinstruction position following any branch is called the branch delay slot.If nothing special was done by the hardware, the decision to branch ornot, together with the branch target address, would emerge at the endof the ALU pipestage—by which time, as Figure 1.3 shows, you’re toolate to present an address for an instruction in even the next-but-onepipeline slot.But branches are important enough to justify special treatment, and youcan see from Figure 1.3 that a special path is provided through the ALU tomake the branch address available half a clock cycle early. Together withthe odd half-clock-cycle shift of the instruction fetch stage, that meansthat the branch target can be fetched in time to become the next but one,
73
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
MIPS R2K VLSI Design
Process: 2 µm, two metal layersClock Frequency: 8–15 MHzSize: 110K transistors, 80 mm2
74
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
Ratio of
MIPS
to
VAX
-- H&P, Appendix J, from Bhandarkar and Clark, 1991
Performance Ratio
Instructions Executed Ratio
CPI Ratio
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
spice
matri
x
nasa7
fpppp
tom
catv
doduc
espre
sso
eqntott li
2x more instr
6x lower CPI
2-4x higher perf
CISC/RISC Convergence
by Linley Gwennap
Not to be left out in the move to thenext generation of RISC, MIPS Tech-nologies (MTI) unveiled the design ofthe R10000, also known as T5. As thespiritual successor to the R4000, thenew design will be the basis of high-end
MIPS processors for some time, at least until 1997. Byswapping superpipelining for an aggressively out-of-order superscalar design, the R10000 has the potentialto deliver high performance throughout that period.
The new processor uses deep queues decouple theinstruction fetch logic from the execution units. Instruc-tions that are ready to execute can jump ahead of thosewaiting for operands, increasing the utilization of the ex-ecution units. This technique, known as out-of-order ex-ecution, has been used in PowerPC processors for sometime (see 081402.PDF ), but the new MIPS design is themost aggressive implementation yet, allowing more in-structions to be queued than any of its competitors.
Taking advantage of its experience with the 200-MHz R4400, MTI was able to streamline the design andexpects it to run at a high clock rate. Speaking at theMicroprocessor Forum, MTI’s Chris Rowen said that thefirst R10000 processors will reach a speed of 200 MHz,50% faster than the PowerPC 620. At this speed, he ex-pects performance in excess of 300 SPECint92 and 600SPECfp92, challenging Digital’s 21164 for the perfor-mance lead. Due to schedule slips, however, the R10000has not yet taped out; we do not expect volume ship-ments until 4Q95, by which time Digital may enhancethe performance of its processor.
Speculative Execution Beyond BranchesThe front end of the processor is responsible for
maintaining a continuous flow of instructions into thequeues, despite problems caused by branches and cachemisses. As Figure 1 shows, the chip uses a two-way set-associative instruction cache of 32K. Like other highlysuperscalar designs, the R10000 predecodes instructionsas they are loaded into this cache, which holds four extra
bits per instruction. These bits reducethe time needed to determine the ap-propriate queue for each instruction.
The processor fetches four instruc-tions per cycle from the cache and de-codes them. If a branch is discovered, itis immediately predicted; if it is pre-dicted taken, the target address is sentto the instruction cache, redirecting thefetch stream. Because of the one cycleneeded to decode the branch, takenbranches create a “bubble” in the fetchstream; the deep queues, however, gen-erally prevent this bubble from delay-ing the execution pipeline.
The sequential instructions thatare loaded during this extra cycle arenot discarded but are saved in a “re-sume” cache. If the branch is later de-termined to have been mispredicted, thesequential instructions are reloadedfrom the resume cache, reducing themispredicted branch penalty by onecycle. The resume cache has four entriesof four instructions each, allowing spec-ulative execution beyond four branches.
The R10000 design uses the stan-dard two-bit Smith method to predict
M I C R O P R O C E S S O R R E P O R T
MIPS R10000 Uses Decoupled Architecture Vol. 8, No. 14, October 24, 1994 © 1994 MicroDesign Resources
MIPS R10000 Uses Decoupled ArchitectureHigh-Performance Core Will Drive MIPS High-End for Years
1 9 9 4
FORUMMICROPROCESSOR
Figure 1. The R10000 uses deep instruction queues to decouple the instruction fetch logicfrom the five function units.
Instruction Cache32K, two-way associative
PC
Unit
Predecode
Unit
ITLB8 entry
Decode, Map,
DispatchActive
ListMapTable
Main TLB64 entries
ALU1
Data Cache32K, two-way associative
FP
Adder
4 instr
4 instr
4 instr
MemoryQueue
16 entries
IntegerQueue16 entries
FPQueue16 entries
ALU2 FP
Mult÷!
FP÷"
virtualaddr
phys addr
64
DataSRAM
128
512K-16M
Avala
nche B
us (
64 b
it a
ddr/
data
)
L2 C
ache Inte
rface
128
Syste
m Inte
rface
TagSRAM
BHT512 x 2
Resume
Cache
Address
Adder
Integer Registers64 ! 64 bits
FP Registers64 ! 64 bits
MIPS R10K uses sophisticatedout-of-order engine; branch
delay slot not useful
– Gwennap, MPR, 1994
Intel Nehalem frontend breaks x86 CISCinto smaller RISC-like µops; µcode engine
handles rarely used complex instr
– Kanter, Real World Technologies, 2009
75
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
Microprogamming Today
• Microprogramming is far from extinct
• Played a crucial role in microprocessors of the 1980s(DEC VAX, Motorola 68K series, Intel 386/486)
• Microprogramming plays assisting role in many modern processors(AMD Phenom, Intel Nehalem, Intel Atom, IBM Z196)
– 761 Z196 instructions executed with hardwired control– 219 Z196 “complex” instructions always executed with microcode– 24 Z196 instructions conditionally executed with microcode
• Patchable microcode common for post-fabrication bug fixes (Intelprocessors load µcode patches at bootup)
76
top related