chapter 7eng.staff.alexu.edu.eg/~mmorsy/courses/... · chapter 7 digital design and computer...
TRANSCRIPT
Chapter 7 <1>
Digital Design and Computer Architecture, 2nd Edition
Chapter 7
David Money Harris and Sarah L. Harris
Chapter 7 <2>
Chapter 7 :: Topics
• Introduction
• Performance Analysis
• Single-Cycle Processor
• Multicycle Processor
• Pipelined Processor
• Exceptions
• Advanced Microarchitecture
Chapter 7 <3>
• Microarchitecture: how to implement an architecture in hardware
• Processor:– Datapath: functional blocks
– Control: control signals
Physics
Devices
Analog
Circuits
Digital
Circuits
Logic
Micro-
architecture
Architecture
Operating
Systems
Application
Software
electrons
transistors
diodes
amplifiers
filters
AND gates
NOT gates
adders
memories
datapaths
controllers
instructions
registers
device drivers
programs
Introduction
Chapter 7 <4>
• Multiple implementations for a single architecture:– Single-cycle: Each instruction executes in a
single cycle
– Multicycle: Each instruction is broken into series of shorter steps
– Pipelined: Each instruction broken up into series of steps & multiple instructions execute at once
Microarchitecture
Chapter 7 <5>
• Program execution time
Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)
• Definitions:– CPI: Cycles/instruction– clock period: seconds/cycle– IPC: instructions/cycle = IPC
• Challenge is to satisfy constraints of:– Cost– Power– Performance
Processor Performance
Chapter 7 <6>
• Consider subset of MIPS instructions:– R-type instructions: and, or, add, sub, slt
– Memory instructions: lw, sw
– Branch instructions: beq
MIPS Processor
Chapter 7 <7>
• Determines everything about a processor:– PC
– 32 registers
– Memory
Architectural State
Chapter 7 <8>
CLK
A RD
Instruction
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Register
File
A RD
Data
Memory
WD
WEPCPC'
CLK
32 3232 32
32
32
3232
32
32
5
5
5
MIPS State Elements
Chapter 7 <9>
• Datapath
• Control
Single-Cycle MIPS Processor
Chapter 7 <10>
STEP 1: Fetch instruction
CLK
A RD
Instruction
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Register
File
A RD
Data
Memory
WD
WEPCPC'
Instr
CLK
Single-Cycle Datapath: lw fetch
Chapter 7 <11>
STEP 2: Read source operands from RF
Instr
CLK
A RD
Instruction
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Register
File
A RD
Data
Memory
WD
WEPCPC'
25:21
CLK
Single-Cycle Datapath: lw Register Read
Chapter 7 <12>
STEP 3: Sign-extend the immediate
SignImm
CLK
A RD
Instruction
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
A RD
Data
Memory
WD
WEPCPC' Instr
25:21
15:0
CLK
Single-Cycle Datapath: lw Immediate
Chapter 7 <13>
STEP 4: Compute the memory address
SignImm
CLK
A RD
Instruction
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
A RD
Data
Memory
WD
WEPCPC' Instr
25:21
15:0
SrcB
ALUResult
SrcA Zero
CLK
ALUControl2:0
ALU
010
Single-Cycle Datapath: lw address
Chapter 7 <14>
• STEP 5: Read data from memory and write it back to register file
A1
A3
WD3
RD2
RD1WE3
A2
SignImm
CLK
A RD
Instruction
Memory
CLK
Sign Extend
Register
File
A RD
Data
Memory
WD
WEPCPC' Instr
25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
Single-Cycle Datapath: lw Memory Read
Chapter 7 <15>
STEP 6: Determine address of next instruction
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
A RD
Data
Memory
WD
WEPCPC' Instr
25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
PCPlus4
Result
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
Single-Cycle Datapath: lw PC Increment
Chapter 7 <16>
Write data in rt to memory
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
A RD
Data
Memory
WD
WEPCPC' Instr
25:21
20:16
15:0
SrcB20:16
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
MemWriteRegWrite
Zero
CLK
ALUControl2:0
ALU
10100
Single-Cycle Datapath: sw
Chapter 7 <17>
• Read from rs and rt
• Write ALUResult to register file
• Write to rd (instead of rt)
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCPC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
ALUResult ReadData
WriteData
SrcA
PCPlus4WriteReg
4:0
Result
RegDst MemWrite MemtoRegALUSrcRegWrite
Zero
CLK
ALUControl2:0
ALU
0varies1 001
Single-Cycle Datapath: R-Type
Chapter 7 <18>
• Determine whether values in rs and rt are equal
• Calculate branch target address:
BTA = (sign-extended immediate << 2) + (PC+4)
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
RegDst Branch MemWrite MemtoRegALUSrcRegWrite
Zero
PCSrc
CLK
ALUControl2:0
ALU
01100 x0x 1
Single-Cycle Datapath: beq
Chapter 7 <19>
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
Single-Cycle Processor
Chapter 7 <20>
RegDst
Branch
MemWrite
MemtoReg
ALUSrcOpcode5:0
Control
Unit
ALUControl2:0Funct5:0
Main
Decoder
ALUOp1:0
ALU
Decoder
RegWrite
Single-Cycle Control
Chapter 7 <21>
ALU
N N
N
3
A B
Y
F
F2:0 Function
000 A & B
001 A | B
010 A + B
011 not used
100 A & ~B
101 A | ~B
110 A - B
111 SLT
Review: ALU
Chapter 7 <22>
+
2 01
A B
Cout
Y
3
01
F2
F1:0
[N-1] S
NN
N
N
N NNN
N
2Z
ero
Exte
nd
Review: ALU
Chapter 7 <23>
ALUOp1:0 Meaning
00 Add
01 Subtract
10 Look at Funct
11 Not Used
ALUOp1:0 Funct ALUControl2:0
00 X 010 (Add)
X1 X 110 (Subtract)
1X 100000 (add) 010 (Add)
1X 100010 (sub) 110 (Subtract)
1X 100100 (and) 000 (And)
1X 100101 (or) 001 (Or)
1X 101010 (slt) 111 (SLT)
Control Unit: ALU Decoder
Chapter 7 <24>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000
lw 100011
sw 101011
beq 000100
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
Control Unit Main Decoder
Chapter 7 <25>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 0 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
Control Unit: Main Decoder
Chapter 7 <26>
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0010
01
0
0
1
0
Single-Cycle Datapath: or
Chapter 7 <27>
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
No change to datapath
Extended Functionality: addi
Chapter 7 <28>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 1 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
addi 001000
Control Unit: addi
Chapter 7 <29>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 1 1 0 0 0 0 10
lw 100011 1 0 1 0 0 1 00
sw 101011 0 X 1 0 1 X 00
beq 000100 0 X 0 1 0 X 01
addi 001000 1 0 1 0 0 0 00
Control Unit: addi
Chapter 7 <30>
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1PC'
Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0
1
25:0 <<2
27:0 31:28
PCJump
Jump
Extended Functionality: j
Chapter 7 <31>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000 1 1 0 0 0 0 10 0
lw 100011 1 0 1 0 0 1 00 0
sw 101011 0 X 1 0 1 X 00 0
beq 000100 0 X 0 1 0 X 01 0
j 000100
Control Unit: Main Decoder
Chapter 7 <32>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000 1 1 0 0 0 0 10 0
lw 100011 1 0 1 0 0 1 00 0
sw 101011 0 X 1 0 1 X 00 0
beq 000100 0 X 0 1 0 X 01 0
j 000100 0 X X X 0 X XX 1
Control Unit: Main Decoder
Chapter 7 <33>
Program Execution Time
= (#instructions)(cycles/instruction)(seconds/cycle)
= # instructions x CPI x TC
Review: Processor Performance
Chapter 7 <34>
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
1
0100
1
0
1
0 0
TC limited by critical path (lw)
Single-Cycle Performance
Chapter 7 <35>
• Single-cycle critical path:Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup
• Typically, limiting paths are: – memory, ALU, register file
– Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
Single-Cycle Performance
Chapter 7 <36>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = ?
Single-Cycle Performance Example
Chapter 7 <37>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
= [30 + 2(250) + 150 + 25 + 200 + 20] ps
= 925 ps
Single-Cycle Performance Example
Chapter 7 <38>
Program with 100 billion instructions:
Execution Time = # instructions x CPI x TC
= (100 × 109)(1)(925 × 10-12 s)
= 92.5 seconds
Single-Cycle Performance Example
Chapter 7 <39>
• Single-cycle:+ simple
- cycle time limited by longest instruction (lw)
- 2 adders/ALUs & 2 memories
• Multicycle:+ higher clock speed
+ simpler instructions run faster
+ reuse expensive hardware on multiple cycles
- sequencing overhead paid many times
• Same design steps: datapath & control
Multicycle MIPS Processor
Chapter 7 <40>
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Register
File
PCPC'
WD
WE
CLK
EN
• Replace Instruction and Data memories with a single unified memory – more realistic
Multicycle State Elements
Chapter 7 <41>
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Register
File
PCPC' Instr
CLK
WD
WE
CLK
EN
IRWrite
STEP 1: Fetch instruction
Multicycle Datapath: Instruction Fetch
Chapter 7 <42>
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Register
File
PCPC' Instr25:21
CLK
WD
WE
CLK CLK
A
EN
IRWrite
Multicycle Datapath: lw Register Read
STEP 2a: Read source operands from RF
Chapter 7 <43>
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
PCPC' Instr25:21
15:0
CLK
WD
WE
CLK CLK
A
EN
IRWrite
Multicycle Datapath: lw Immediate
STEP 2b: Sign-extend the immediate
Chapter 7 <44>
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
PCPC' Instr25:21
15:0
SrcB
ALUResult
SrcA
ALUOut
CLK
ALUControl2:0
ALU
WD
WE
CLK CLK
A CLK
EN
IRWrite
Multicycle Datapath: lw Address
STEP 3: Compute the memory address
Chapter 7 <45>
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
PCPC' Instr25:21
15:0
SrcB
ALUResult
SrcA
ALUOut
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A CLK
EN
IRWriteIorD
0
1
Multicycle Datapath: lw Memory Read
STEP 4: Read data from memory
Chapter 7 <46>
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
PCPC' Instr25:21
15:0
SrcB20:16
ALUResult
SrcA
ALUOut
RegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A CLK
EN
IRWriteIorD
0
1
Multicycle Datapath: lw Write Register
STEP 5: Write data back to register file
Chapter 7 <47>
PCWrite
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1PCPC' Instr25:21
15:0
SrcB
20:16
ALUResult
SrcA
ALUOut
ALUSrcARegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorD
0
1
Multicycle Datapath: Increment PC
STEP 6: Increment PC
Chapter 7 <48>
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
ALUResult
SrcA
ALUOut
MemWrite ALUSrcARegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorDPCWrite
B
Write data in rt to memory
Multicycle Datapath: sw
Chapter 7 <49>
0
1
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
ALUResult
SrcA
ALUOut
RegDstMemWrite MemtoReg ALUSrcARegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorDPCWrite
• Read from rs and rt
• Write ALUResult to register file
• Write to rd (instead of rt)
Multicycle Datapath: R-Type
Chapter 7 <50>
SignImm
b
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorD PCWrite
PCEn
• rs == rt?
• BTA = (sign-extended immediate << 2) + (PC+4)
Multicycle Datapath: beq
Chapter 7 <51>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Re
gD
st
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
Multicycle Processor
Chapter 7 <52>
ALUSrcA
PCSrc
Branch
ALUSrcB1:0
Opcode5:0
Control
Unit
ALUControl2:0
Funct5:0
Main
Controller
(FSM)
ALUOp1:0
ALU
Decoder
RegWrite
PCWrite
IorD
MemWrite
IRWrite
RegDst
MemtoReg
Register
Enables
Multiplexer
Selects
Multicycle Control
Chapter 7 <53>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Re
gD
st
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
0
1 1
0
X
X
00
01
0100
1
0
Reset
S0: Fetch
Main Controller FSM: Fetch
Chapter 7 <54>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Re
gD
st
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
0
1 1
0
X
X
00
01
0100
1
0
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Reset
S0: Fetch
Main Controller FSM: Fetch
Chapter 7 <55>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Reset
S0: Fetch S1: Decode
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Reg
Dst
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
X
0 0
0
X
X
0X
XX
XXXX
0
0
Main Controller FSM: Decode
Chapter 7 <56>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
Reset
S0: Fetch
S2: MemAdr
S1: Decode
Op = LW
or
Op = SW
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Re
gD
st
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
X
0 0
0
X
X
01
10
010X
0
0
Main Controller FSM: Address
Chapter 7 <57>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
Reset
S0: Fetch
S2: MemAdr
S1: Decode
Op = LW
or
Op = SW
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Re
gD
st
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
X
0 0
0
X
X
01
10
010X
0
0
Main Controller FSM: Address
Chapter 7 <58>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemRead
Op = LW
or
Op = SW
Op = LW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
Main Controller FSM: lw
Chapter 7 <59>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1IorD = 1
MemWrite
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
Op = LW
or
Op = SW
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
Main Controller FSM: sw
Chapter 7 <60>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
Op = LW
or
Op = SW
Op = R-type
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
Main Controller FSM: R-Type
Chapter 7 <61>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
Main Controller FSM: beq
Chapter 7 <62>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
Multicycle Controller FSM
Chapter 7 <63>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
Op = ADDI
S9: ADDI
Execute
S10: ADDI
Writeback
Extended Functionality: addi
Chapter 7 <64>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 0
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
RegDst = 0
MemtoReg = 0
RegWrite
Op = ADDI
S9: ADDI
Execute
S10: ADDI
Writeback
Main Controller FSM: addi
Chapter 7 <65>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorD PCWrite
PCEn
00
01
10
<<2
25:0 (jump)
31:28
27:0
PCJump
Extended Functionality: j
Chapter 7 <66>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 00
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 01
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
RegDst = 0
MemtoReg = 0
RegWrite
Op = ADDI
S9: ADDI
Execute
S10: ADDI
Writeback
Op = J
S11: Jump
Main Controller FSM: j
Chapter 7 <67>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 00
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 01
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 1
RegWrite
S4: Mem
Writeback
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
RegDst = 0
MemtoReg = 0
RegWrite
Op = ADDI
S9: ADDI
Execute
S10: ADDI
Writeback
PCSrc = 10
PCWrite
Op = J
S11: Jump
Main Controller FSM: j
Chapter 7 <68>
• Instructions take different number of cycles:
– 3 cycles: beq, j
– 4 cycles: R-Type, sw, addi
– 5 cycles: lw
• CPI is weighted average
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 11% branches
– 2% jumps
– 52% R-type
Average CPI = (0.11 + 0.2)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
Multicycle Processor Performance
Chapter 7 <69>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
Re
gD
st
Branch
MemWrite
Mem
toR
eg
ALUSrcA
RegWriteOp
Funct
Control
Unit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
Multicycle critical path:
Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup
Multicycle Processor Performance
Chapter 7 <70>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = ?
Multicycle Performance Example
Chapter 7 <71>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup
= tpcq_PC + tmux + tmem + tsetup
= [30 + 25 + 250 + 20] ps
= 325 ps
Multicycle Performance Example
Chapter 7 <72>
• For a program with 100 billion instructions executing on a multicycle MIPS processor
– CPI = 4.12
– Tc = 325 ps
Execution Time =
Chapter 7 :: Topics
Chapter 7 <73>
• For a program with 100 billion instructions executing on a multicycle MIPS processor
– CPI = 4.12
– Tc = 325 ps
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12)
= 133.9 seconds
• This is slower than the single-cycle processor (92.5 seconds). Why?
Chapter 7 <74>
Program with 100 billion instructions
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12)
= 133.9 seconds
This is slower than the single-cycle processor (92.5 seconds). Why?– Not all steps same length
– Sequencing overhead for each step (tpcq + tsetup= 50 ps)
Multicycle Performance Example
Chapter 7 <75>
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
Branch
MemWrite
MemtoReg
ALUSrc
RegWrite
Op
Funct
Control
Unit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0
1
25:0 <<2
27:0 31:28
PCJump
Jump
Review: Single-Cycle Processor
Chapter 7 <76>
ImmExt
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
Zero
CLK
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
00
01
10
<<2
25:0 (Addr)
31:28
27:0
PCJump
5:0
31:26
Branch
MemWrite
ALUSrcA
RegWriteOp
Funct
Control
Unit
PCSrc
CLK
ALUControl2:0
ALUSrcB1:0IRWrite
IorD
PCWrite
PCEn
Reg
Dst
Mem
toR
eg
Review: Multicycle Processor
Chapter 7 <77>
• Temporal parallelism
• Divide single-cycle processor into 5 stages:– Fetch
– Decode
– Execute
– Memory
– Writeback
• Add pipeline registers between stages
Pipelined MIPS Processor
Chapter 7 <78>
Time (ps)Instr
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read / Write
Write
Reg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read / Write
Write
Reg
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read/Write
Write
Reg
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read/Write
Write
Reg
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read/Write
Write
Reg
Single-Cycle
Pipelined
Single-Cycle vs. Pipelined
Chapter 7 <79>
Time (cycles)
lw $s2, 40($0) RF 40
$0
RF$s2
+ DM
RF $t2
$t1
RF$s3
+ DM
RF $s5
$s1
RF$s4
- DM
RF $t6
$t5
RF$s5
& DM
RF 20
$s1
RF$s6
+ DM
RF $t4
$t3
RF$s7
| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IMlw
sub
and
sw
or
Pipelined Processor Abstraction
Chapter 7 <80>
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCF0
1
PC' InstrD25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
Single-Cycle & Pipelined Datapath
Chapter 7 <81>
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCF0
1
PC' InstrD25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK
CLK
CLK
Fetch Decode Execute Memory Writeback
WriteReg must arrive at same time as Result
Corrected Pipelined Datapath
Chapter 7 <82>
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
• Same control unit as single-cycle processor• Control delayed to proper pipeline stage
Pipelined Processor Control
Chapter 7 <83>
• When an instruction depends on result from instruction that hasn’t completed
• Types:
– Data hazard: register value not yet written back to register file
– Control hazard: next instruction not decided yet (caused by branches)
Pipeline Hazards
Chapter 7 <84>
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Data Hazard
Chapter 7 <85>
• Insert nops in code at compile time
• Rearrange code at compile time
• Forward data at run time
• Stall the processor at run time
Handling Data Hazards
Chapter 7 <86>
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
|DM
RF $s5
$s0
RF$t2
-DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
• Insert enough nops for result to be ready
• Or move independent useful instructions forward
Compile-Time Hazard Elimination
Chapter 7 <87>
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Data Forwarding
Chapter 7 <88>
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdD
RtD
RegW
rite
M
RegW
rite
W
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
Data Forwarding
Chapter 7 <89>
• Forward to Execute stage from either:– Memory stage or
– Writeback stage
• Forwarding logic for ForwardAE:
if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM)
then ForwardAE = 10
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)
then ForwardAE = 01
else ForwardAE = 00
Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding
Chapter 7 <90>
Time (cycles)
lw $s0, 40($0) RF 40
$0
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
Stalling
Chapter 7 <91>
Time (cycles)
lw $s0, 40($0) RF 40
$0
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
9
RF $s1
$s0
IMor
Stall
Stalling
Chapter 7 <92>
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdD
RtD
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Stalling Hardware
Chapter 7 <93>
lwstall =
((rsD==rtE) OR (rtD==rtE)) AND MemtoRegE
StallF = StallD = FlushE = lwstall
Stalling Logic
Chapter 7 <94>
• beq:
– branch not determined until 4th stage of pipeline
– Instructions after branch fetched before branch occurs
– These instructions must be flushed if branch happens
• Branch misprediction penalty– number of instruction flushed when branch is taken
– May be reduced by determining branch earlier
Control Hazards
Chapter 7 <95>
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdD
RtD
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Control Hazards: Original Pipeline
Chapter 7 <96>
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1
RF- DM
RF $s1
$s0
RF& DM
RF $s0
$s4
RF| DM
RF $s5
$s0
RF- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
20
24
28
2C
30
...
...
9
Flush
these
instructions
64 slt $t3, $s2, $s3 RF $s3
$s2
RF$t3s
lt
DMIMslt
Control Hazards
Chapter 7 <97>
EqualD
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
=
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdE
RtD
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Introduced another data hazard in Decode stage
Early Branch Resolution
Chapter 7 <98>
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1
RF- DM
RF $s1
$s0
RF& DMand $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
andIM
IMlw
20
24
28
2C
30
...
...
9
Flush
this
instruction
64 slt $t3, $s2, $s3 RF $s3
$s2
RF$t3s
lt
DMIMslt
Early Branch Resolution
Chapter 7 <99>
EqualD
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
Forw
ard
AD
Forw
ard
BD
20:16RtE
RsD
RdD
RtD
RegW
rite
E
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Bra
nchD
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Handling Data & Control Hazards
Chapter 7 <100>
• Forwarding logic:ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM
ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM
• Stalling logic:branchstall = BranchD AND RegWriteE AND
(WriteRegE == rsD OR WriteRegE == rtD)
OR
BranchD AND MemtoRegM AND
(WriteRegM == rsD OR WriteRegM == rtD)
StallF = StallD = FlushE = lwstall OR branchstall
Control Forwarding & Stalling Logic
Chapter 7 <101>
• Guess whether branch will be taken
– Backward branches are usually taken (loops)
– Consider history to improve guess
• Good prediction reduces fraction of branches requiring a flush
Branch Prediction
Chapter 7 <102>
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 11% branches
– 2% jumps
– 52% R-type
• Suppose:
– 40% of loads used by next instruction
– 25% of branches mispredicted
– All jumps flush next instruction
• What is the average CPI?
Pipelined Performance Example
Chapter 7 <103>
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 11% branches
– 2% jumps
– 52% R-type
• Suppose:
– 40% of loads used by next instruction
– 25% of branches mispredicted
– All jumps flush next instruction
• What is the average CPI?
– Load/Branch CPI = 1 when no stalling, 2 when stalling
– CPIlw = 1(0.6) + 2(0.4) = 1.4
– CPIbeq = 1(0.75) + 2(0.25) = 1.25
Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1)
= 1.15
Pipelined Performance Example
Chapter 7 <104>
• Pipelined processor critical path:
Tc = max {
tpcq + tmem + tsetup
2(tRFread + tmux + teq + tAND + tmux + tsetup )
tpcq + tmux + tmux + tALU + tsetup
tpcq + tmemwrite + tsetup
2(tpcq + tmux + tRFwrite) }
Pipelined Performance
Chapter 7 <105>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100 ps
Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )
= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps
Pipelined Performance Example
Chapter 7 <106>
Program with 100 billion instructions
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.15)(550 × 10-12)
= 63 seconds
Pipelined Performance Example
Chapter 7 <107>
Processor
Execution
Time
(seconds)
Speedup
(single-cycle as baseline)
Single-cycle 92.5 1
Multicycle 133 0.70
Pipelined 63 1.47
Processor Performance Comparison
Chapter 7 <108>
• Unscheduled function call to exception handler
• Caused by:
– Hardware, also called an interrupt, e.g. keyboard
– Software, also called traps, e.g. undefined instruction
• When exception occurs, the processor:
– Records cause of exception (Cause register)
– Jumps to exception handler (0x80000180)
– Returns to program (EPC register)
Review: Exceptions
Chapter 7 <109>
Example Exception
Chapter 7 <110>
• Not part of register file– Cause
• Records cause of exception
• Coprocessor 0 register 13
– EPC (Exception PC)
• Records PC where exception occurred
• Coprocessor 0 register 14
• Move from Coprocessor 0– mfc0 $t0, Cause
– Moves contents of Cause into $t0
00000 $t0 (8) Cause (13) 00000000000
mfc0
31:26 25:21 20:16 15:11 10:0
010000
Exception Registers
Chapter 7 <111>
Exception Cause
Hardware Interrupt 0x00000000
System Call 0x00000020
Breakpoint / Divide by 0 0x00000024
Undefined Instruction 0x00000028
Arithmetic Overflow 0x00000030
Extend multicycle MIPS processor to handle last two types of exceptions
Exception Causes
Chapter 7 <112>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorD PCWrite
PCEn
<<2
25:0 (jump)
31:28
27:0
PCJump
00
01
10
11
0x8000 0180
Overflow
CLK
EN
EPCWrite
CLK
EN
CauseWrite
0
1
IntCause
0x30
0x28EPC
Cause
Exception Hardware: EPC & Cause
Chapter 7 <113>
IorD = 0
AluSrcA = 0
ALUSrcB = 01
ALUOp = 00
PCSrc = 00
IRWrite
PCWrite
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
IorD = 1
RegDst = 1
MemtoReg = 00
RegWrite
IorD = 1
MemWrite
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCSrc = 01
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALU
Writeback
S8: Branch
Op = LW
or
Op = SW
Op = R-type
Op = BEQ
Op = LW
Op = SW
RegDst = 0
MemtoReg = 01
RegWrite
S4: Mem
Writeback
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
RegDst = 0
MemtoReg = 00
RegWrite
Op = ADDI
S9: ADDI
Execute
S10: ADDI
Writeback
PCSrc = 10
PCWrite
Op = J
S11: Jump
Overflow Overflow
S13:
Overflow
PCSrc = 11
PCWrite
IntCause = 0
CauseWrite
EPCWrite
Op = others
PCSrc = 11
PCWrite
IntCause = 1
CauseWrite
EPCWrite
S12: Undefined
RegDst = 0
Memtoreg = 10
RegWrite
Op = mfc0
S14: MFC0
Control FSM with Exceptions
Chapter 7 <114>
SignImm
CLK
ARD
Instr / Data
Memory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg1:0
ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0001
Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
ALUSrcB1:0
IRWriteIorD PCWrite
PCEn
<<2
25:0 (jump)
31:28
27:0
PCJump
00
01
10
11
0x8000 0180
CLK
EN
EPCWrite
CLK
EN
CauseWrite
0
1
IntCause
0x30
0x28EPC
Cause
Overflow
...
01101
01110
...15:11
10
C0
Exception Hardware: mfc0
Chapter 7 <115>
• Deep Pipelining
• Branch Prediction
• Superscalar Processors
• Out of Order Processors
• Register Renaming
• SIMD
• Multithreading
• Multiprocessors
Advanced Microarchitecture
Chapter 7 <116>
• 10-20 stages typical
• Number of stages limited by:
– Pipeline hazards
– Sequencing overhead
– Power
– Cost
Deep Pipelining
Chapter 7 <117>
• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:
– Check direction of branch (forward or backward)– If backward, predict taken– Else, predict not taken
• Dynamic branch prediction:– Keep history of last (several hundred) branches in
branch target buffer, record:• Branch destination• Whether branch was taken
Branch Prediction
Chapter 7 <118>
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
beq $s0, $t0, done # if i == 10, branch
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i
j for
done:
Branch Prediction Example
Chapter 7 <119>
• Remembers whether branch was taken the last time and does the same thing
• Mispredicts first and last branch of loop
1-Bit Branch Predictor
Chapter 7 <120>
Only mispredicts last branch of loop
strongly
taken
predict
taken
weakly
taken
predict
taken
weakly
not taken
predict
not taken
strongly
not taken
predict
not takentaken taken taken
takentakentaken
taken
taken
2-Bit Branch Predictor
Chapter 7 <121>
• Multiple copies of datapath execute multiple instructions at once
• Dependencies make it tricky to issue multiple instructions at once
CLK CLK CLK CLK
ARD A1
A2RD1A3
WD3WD6
A4A5A6
RD4
RD2RD5
Instruction
Memory
Register
File Data
Memory
ALU
s
PC
CLK
A1A2
WD1WD2
RD1RD2
Superscalar
Chapter 7 <122>
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 2
or $t3, $s5, $s6
sw $s7, 80($t3)
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lw
add
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
$t1$s2
$s1
+
RF$s3
$s1
RF
$t2-
DMIM
sub
and $t3$s4
$s3
&
RF$s5
$s1
RF
$t4|
DMIM
or
sw80
$s0
+ $s5
Superscalar Example
Chapter 7 <123>
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 6/5 = 1.17
or $t3, $s5, $s6
sw $s7, 80($t3)
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
RF$s1
$t0add
RF$s1
$t0
RF
$t1+
DM
RF$t0
$s4
RF
$t2&
DMIM
and
IMor
and
sub
|$s6
$s5$t3
RF80
$t3
RF
+
DM
sw
IM
$s7
9
$s3
$s2
$s3
$s2
-$t0
oror $t3, $s5, $s6
IM
Superscalar with Dependencies
Chapter 7 <124>
• Looks ahead across multiple instructions
• Issues as many instructions as possible at once
• Issues instructions out of order (as long as no dependencies)
• Dependencies:
– RAW (read after write): one instruction writes, later instruction reads a register
– WAR (write after read): one instruction reads, later instruction writes a register
– WAW (write after write): one instruction writes, later instruction writes a register
Out of Order Processor
Chapter 7 <125>
• Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3)
• Scoreboard: table that keeps track of:
– Instructions waiting to issue
–Available functional units
–Dependencies
Out of Order Processor
Chapter 7 <126>
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 6/4 = 1.5
or $t3, $s5, $s6
sw $s7, 80($t3)Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
or|$s6
$s5$t3
RF80
$t3
RF
+
DM
sw $s7
or $t3, $s5, $s6
IM
RF$s1
$t0
RF
$t1+
DMIM
add
sub-$s3
$s2$t0
two cycle latency
between load and
use of $t0
RAW
WAR
RAW
RF$t0
$s4
RF
&
DM
and
IM
$t2
RAW
Out of Order Processor Example
Chapter 7 <127>
Time (cycles)
1 2 3 4 5 6 7
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $r0, $s2, $s3
and $t2, $s4, $r0
sw $s7, 80($t3)
sub-$s3
$s2$r0
RF$r0
$s4
RF
&
DM
and
$s7
or $t3, $s5, $s6
IM
RF$s1
$t0
RF
$t1+
DMIM
add
sw+80
$t3
RAW
$s6
$s5
|or
2-cycle RAW
RAW
$t2
$t3
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3 Ideal IPC: 2
and $t2, $s4, $t0 Actual IPC: 6/3 = 2
or $t3, $s5, $s6
sw $s7, 80($t3)
Register Renaming
Chapter 7 <128>
• Single Instruction Multiple Data (SIMD)– Single instruction acts on multiple pieces of data at once
– Common application: graphics
– Perform short arithmetic operations (also called packed arithmetic)
• For example, add four 8-bit elements
padd8 $s2, $s0, $s1
a0
0781516232432 Bit position
$s0a1
a2
a3
b0
$s1b1
b2
b3
a0 + b
0$s2a
1 + b
1a
2 + b
2a
3 + b
3
+
SIMD
Chapter 7 <129>
• Multithreading
– Wordprocessor: thread for typing, spell checking, printing
• Multiprocessors
– Multiple processors (cores) on a single chip
Advanced Architecture Techniques
Chapter 7 <130>
• Process: program running on a computer
– Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper
• Thread: part of a program
– Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing
Threading: Definitions
Chapter 7 <131>
• One thread runs at once
• When one thread stalls (for example, waiting for memory):– Architectural state of that thread stored
– Architectural state of waiting thread loaded into processor and it runs
– Called context switching
• Appears to user like all threads running simultaneously
Threads in Conventional Processor
Chapter 7 <132>
• Multiple copies of architectural state
• Multiple threads active at once:– When one thread stalls, another runs immediately
– If one thread can’t keep all execution units busy, another thread can use them
• Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput
Intel calls this “hyperthreading”
Multithreading
Chapter 7 <133>
• Multiple processors (cores) with a method of communication between them
• Types:– Homogeneous: multiple cores with shared memory
– Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone)
– Clusters: each core has own memory system
Multiprocessors
Chapter 7 <134>
• Patterson & Hennessy’s: Computer Architecture: A Quantitative Approach
• Conferences:– www.cs.wisc.edu/~arch/www/
– ISCA (International Symposium on Computer Architecture)
– HPCA (International Symposium on High Performance Computer Architecture)
Other Resources