Download - CS104 Computer Organization and Design
![Page 1: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/1.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 1
CS104 Computer Organization and Design
Datapaths
![Page 2: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/2.jpg)
Admin
• Homework • Homework 4 out tonight • Due Monday March 26th • Download/check your submissions
• Reading: • Chapter 4 • (Maybe review 1.4)
• Midterm 2 • March 28
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 2
![Page 3: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/3.jpg)
What did we do last week?
• Who can remind us what we did last week?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 3
![Page 4: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/4.jpg)
What did we do last week?
• Who can remind us what we did last week? • Ski • Go to the beach • Sleep in • Read a book • …
• Ok, but seriously?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 4
![Page 5: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/5.jpg)
When last I saw you all..
• Last time I was here (Feb 27/29) • Learned basics of logic design
• Gates (And, Or, Nor, …) • Put gates together to make
• Muxes • Adders • Latches • Flip-flops • …
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 5
![Page 6: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/6.jpg)
While I was at HPCA..
• Prof. Lebeck started teaching you all about datapaths • Putting logic together to execute instructions • Started on single-cycle datapath
• We’ll review/continue with single cycle • Then jump into more things!
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 6
![Page 7: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/7.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 7
Datapath for MIPS ISA
• Consider only the following instructions add $1,$2,$3 addi $1,2,$3 lw $1,4($3) sw $1,4($3) beq $1,$2,PC_relative_target j absolute_target
• Why only these? • Most other instructions are the same from datapath viewpoint • The one’s that aren’t are left for you to figure out
![Page 8: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/8.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 8
Start With Fetch
• PC and instruction memory • A +4 incrementer computes default next instruction PC
P C
Insn Mem
+ 4
![Page 9: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/9.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 9
First Instruction: add
• Add register file and ALU
P C
Insn Mem
Register File
s1 s2 d
+ 4
![Page 10: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/10.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 10
Second Instruction: addi
• Destination register can now be either Rd or Rt • Add sign extension unit and mux into second ALU input
P C
Insn Mem
Register File
S X
s1 s2 d
+ 4
![Page 11: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/11.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 11
Third Instruction: lw
• Add data memory, address is ALU output • Add register write data mux to select memory output or ALU output
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
![Page 12: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/12.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 12
Fourth Instruction: sw
• Add path from second input register to data memory data input
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
![Page 13: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/13.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 13
Fifth Instruction: beq
• Add left shift unit and adder to compute PC-relative branch target • Add PC input mux to select PC+4 or branch target
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
z
![Page 14: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/14.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 14
Sixth Instruction: j
• Add shifter to compute left shift of 26-bit immediate • Add additional PC input mux for jump target
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
![Page 15: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/15.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 15
“Continuous Read” Datapath Timing
• Works because writes (PC, RegFile, DMem) are independent • And because no read logically follows any write
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
Read IMem Read Registers Read DMEM Write DMEM Write Registers
Write PC
![Page 16: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/16.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 16
What Is Control?
• 9 signals control flow of data through this datapath • MUX selectors, or register/memory write enable signals • A real datapath has 300-500 control signals
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
Rwe
ALUinB
DMwe
JP
ALUop
BR
Rwd
Rdst
![Page 17: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/17.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 17
Example: Control for add
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
BR=0
JP=0
Rwd=0
DMwe=0 ALUop=0
ALUinB=0 Rdst=1
Rwe=1
![Page 18: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/18.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 18
Example: Control for sw
• Difference between sw and add is 5 signals • 3 if you don’t count the X (don’t care) signals
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
Rwe=0
ALUinB=1
DMwe=1
JP=0
ALUop=0
BR=0
Rwd=X
Rdst=X
![Page 19: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/19.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 19
Example: Control for beq
• Difference between sw and beq is only 4 signals
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
Rwe=0
ALUinB=0
DMwe=0
JP=0
ALUop=1
BR=1
Rwd=X
Rdst=X
![Page 20: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/20.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 20
You all figure LW
• How would these control signals be set for LW?
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
Rwe
ALUinB
DMwe
JP
ALUop
BR
Rwd
Rdst
![Page 21: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/21.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 21
Example: Control for LW
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
BR=0
JP=0
Rwd=1
DMwe=0 ALUop=0
ALUinB=1 Rdst=1
Rwe=1
![Page 22: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/22.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 22
How Is Control Implemented?
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
<< 2
<< 2
Rwe
ALUinB
DMwe
JP
ALUop
BR
Rwd
Rdst
Control?
![Page 23: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/23.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 23
Implementing Control
• Each insn has a unique set of control signals • Most are function of opcode • Some may be encoded in the instruction itself
• E.g., the ALUop signal is some portion of the MIPS Func field + Simplifies controller implementation • Requires careful ISA design
![Page 24: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/24.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 24
Control Implementation: ROM
• ROM (read only memory): think rows of bits • Bits in data words are control signals • Lines indexed by opcode • Example: ROM control for 6-insn MIPS datapath • X is “don’t care”
BR JP ALUinB ALUop DMwe Rwe Rdst Rwd
add 0 0 0 0 0 1 0 0
addi 0 0 1 0 0 1 1 0
lw 0 0 1 0 0 1 1 1
sw 0 0 1 0 1 0 X X
beq 1 0 0 1 0 0 X X
j 0 1 0 0 0 0 X X
opcode
![Page 25: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/25.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 25
Control Implementation: Random Logic
• Real machines have 100+ insns 300+ control signals • 30,000+ control bits (~4KB) – Not huge, but hard to make faster than datapath (important!)
• Alternative: random logic (random = ‘non-repeating’) • Exploits the observation: many signals have few 1s or few 0s • Example: random logic control for 6-insn MIPS datapath
ALUinB
opco
de
add addi lw sw beq j
BR JP DMwe Rwd Rdst ALUop Rwe
![Page 26: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/26.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 26
Datapath and Control Timing
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
Control ROM/random logic
Read IMem Read Registers (Read Control ROM)
Read DMEM Write DMEM Write Registers
Write PC
![Page 27: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/27.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 27
Single-Cycle Datapath Performance
• Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
Control ROM/random logic
![Page 28: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/28.jpg)
Interlude: Performance
• Previous slide alludes to something new: Performance • Don’t just want it to work… • But want it to go fast!
• Three components to performance: Number of instructions x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency)
Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 28
![Page 29: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/29.jpg)
Interlude: Performance
• Three components to performance: Number of instructions <- Compiler’s Job x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency)
Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program
• Insns/Program: determined by compiler + ISA • Generally assume fixed program when do micro-architecture
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 29
![Page 30: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/30.jpg)
Micro-architectural factors
• Micro-architecture: • The details of how the ISA is implemented • Affects CPI and Clock frequency
• Often will look at fixed program, and consider MIPS • Million Instructions Per Second • MIPS = IPC * Frequency (in MHz) • IPC = Instruction Per Cycle (1 / CPI) • Gives “Bigger is better” number
Instructions Cycles Instructions ————— x ————— = —————— Cycle Second Second (IPC) (Frequency) (Throughput)
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 30
![Page 31: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/31.jpg)
“Best” IPC
• For now, best we can do: IPC = 1 (CPI = 1) • Do 1 instruction every cycle
• Later: • Real processors can do multiple instructions at once! • Potentially: IPC < 1! • Best possible IPC depends on design
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 31
![Page 32: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/32.jpg)
Performance vs ….
• 1990s: Performance at all cost • Actually more “clock frequency” at all cost…
• Now: Care about other things • Energy (electric bill, battery life) • Power (cooling, also affects energy) • Area (chip cost) • Reliability (tolerance of transient faults: e.g., charge particle strikes) • …
• Important metric these days “Performance / Watt” • Throughput divided by power consumption • Why?
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 32
![Page 33: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/33.jpg)
Performance Modeling and Analysis
• Speaking of performance • Making a processor takes time (years) and money (millions) • Want to know it will perform well before you finish
• If its wrong, doing it all over is painful… • Performance can be simulated in software
• Estimate what IPC will be • Guide design
• This is my other job by the way…
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 33
![Page 34: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/34.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 34
Single-Cycle Datapath Performance
• Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn
P C
Insn Mem
Register File
S X
s1 s2 d
Data Mem
a
d
+ 4
Control ROM/random logic
![Page 35: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/35.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 35
Alternative: Multi-Cycle Datapath
• Multi-cycle datapath: attacks high clock period • Cut datapath into multiple stages (5 here), isolate using FFs • FSM control “walks” insns thru stages (by staging control signals) + Insns can bypass stages and exit early
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
s3
s3
s3 s4
s5
s5 s5
![Page 36: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/36.jpg)
Finite State Machine (FSM)
• FSM = States + Transitions • Next state: function of current state + inputs • Outputs: function of current state + inputs
• Canonical Example: Combination Lock • Must enter 3 8 4 to unlock
• P.S. Useful in software too
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 36
![Page 37: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/37.jpg)
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• Initial State: no valid piece of combo seen
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 37
Start
![Page 38: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/38.jpg)
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• Input of 3: transition to new state • Any other input: stay in same state CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 38
Start 1 3
0-2,4-9
![Page 39: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/39.jpg)
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• State 1: • Input = 8? Goto state 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 39
Start 1 3
0-2,4-9
2
8
3
0-2,4-7,9
![Page 40: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/40.jpg)
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• State 2: • Input = 4? Goto state 3
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 40
Start 1 3
0-2,4-9
2
8
0-2,5-9
3
3
0-2,4-7,9
3 4
![Page 41: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/41.jpg)
Finite State Machines: Example
• Combination Lock Example: • Need to enter 3 8 4 to unlock
• State 3: Unlock!
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 41
Start 1 3
0-2,4-9
2
8
0-2,5-9
3
3
0-2,4-7,9
3 4
![Page 42: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/42.jpg)
FSM in Hardware
• Flip flop (s) to hold state (s) • Combinatorial logic to determine next state/output • (Assumes FF enable on input_valid)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 42
![Page 43: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/43.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 43
![Page 44: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/44.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 44
![Page 45: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/45.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 45
![Page 46: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/46.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 46
![Page 47: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/47.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 47
![Page 48: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/48.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 48
![Page 49: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/49.jpg)
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 49
![Page 50: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/50.jpg)
FSM Implementation: ROM
• Just saw: FSM implemented with sum-of-products • Remind us what that is?
• Can also be implemented with a ROM CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 50
2(N+K) Entry ROM
Inputs
K
Register
N M
Outputs
N
N + K
K-bit input N-bit state M-bit output
![Page 51: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/51.jpg)
FSM ROM Implementation Example
• Combination Lock (3 8 4) Example • 4-bit input • 2-bit state • 64-entry ROM (indexed with S1 S0 I3 I2 I1 I0)
• Each entry needs 3 bits (S1 S0 U) • 2 for next state • 1 for unlock signal
• Example entries in ROM • 0x00 = 000 • 0x03 = 010 • 0x18 = 100 • 0x13 = 010 • 0x3_ = 001
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 51
![Page 52: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/52.jpg)
Multi-cycle Datapath FSM
• First state: Get a New Instruction • Output signals to fetch (e.g., read enable IMEM) • Next State: Always Decode
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 52
Next Insn
Decode Insn
![Page 53: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/53.jpg)
Multi-cycle Datapath FSM
• Second State: Decode • Output signals to decode instruction (RdEn RegFile) • Go to Next Insn if NOP • Otherwise Execute
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 53
Next Insn
Decode Insn
Execute Insn NOP
![Page 54: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/54.jpg)
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• Branches: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 54
Next Insn
Decode Insn
Execute Insn NOP Branch
![Page 55: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/55.jpg)
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• ALU op: write register
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 55
Next Insn
Decode Insn
Execute Insn NOP Branch
Writeback
ALU
![Page 56: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/56.jpg)
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• Load: Read Memory
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 56
Next Insn
Decode Insn
Execute Insn NOP Branch
Writeback
ALU
Read DMEM
Load
![Page 57: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/57.jpg)
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type) • Next State: Also depends on insn type
• Store: Write Memory
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 57
Next Insn
Decode Insn
Execute Insn NOP Branch
Writeback
ALU
Read DMEM
Load Write DMEM
Store
![Page 58: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/58.jpg)
Multi-cycle Datapath FSM
• Read DMEM State • Control signals enable DMEM Read • Next state is writeback
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 58
Next Insn
Decode Insn
Execute Insn NOP Branch
Writeback
ALU
Read DMEM
Load Write DMEM
Store
![Page 59: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/59.jpg)
Multi-cycle Datapath FSM
• Writeback state • Control signals enable regfile write • Next state: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 59
Next Insn
Decode Insn
Execute Insn NOP Branch
Writeback
ALU
Read DMEM
Load Write DMEM
Store
![Page 60: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/60.jpg)
Multi-cycle Datapath FSM
• Write DMEM state • Control signals enable memory write • Next state: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 60
Next Insn
Decode Insn
Execute Insn NOP Branch
Writeback
ALU
Read DMEM
Load Write DMEM
Store
![Page 61: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/61.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 61
Multi-Cycle Datapath Example: Add
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
• Example: Add • Cycle 1: Read IMEM
![Page 62: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/62.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 62
Multi-Cycle Datapath Example: Add
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
• Example: Add • Cycle 1: Read IMEM • Cycle 2: Decode + Read RF
![Page 63: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/63.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 63
Multi-Cycle Datapath Example: Add
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
• Example: Add • Cycle 1: Read IMEM • Cycle 2: Decode + Read RF • Cycle 3: ALU
![Page 64: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/64.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 64
Multi-Cycle Datapath Example: Add
• Example: Add • Cycle 1: Read IMEM • Cycle 2: Decode + Read RF • Cycle 3: ALU • Cycle 4: Writeback + Increment PC
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
![Page 65: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/65.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 65
Multi-Cycle Datapath Performance
• Opposite performance split of single-cycle datapath + Short clock period – High CPI
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
![Page 66: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/66.jpg)
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 66
![Page 67: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/67.jpg)
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 +
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 67
![Page 68: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/68.jpg)
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 +
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 68
![Page 69: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/69.jpg)
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles • ALU: 4 cycles • Stores: 4 cycles • Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 69
![Page 70: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/70.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 70
Multi-cycle Datapath Performance
• Single-cycle • Clock period = 50ns, CPI = 1 • Performace = 50 ns/insn
• Multi-cycle • Clock period = 10ns • CPI = (0.2*3+0.2*5+0.6*4) = 4 • Performance = 40 ns/insn
• But wait…
![Page 71: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/71.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 71
Multi-Cycle Datapath Performance
• Did not just cut up existing logic into 5 pieces • Also added logic (flip flops)
• So clock period not 1/5 of single cycle, but slightly longer
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
![Page 72: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/72.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 72
Multi-cycle Datapath Performance
• Single-cycle • Clock period = 50ns, CPI = 1 • Performace = 50 ns/insn
• Multi-cycle • Clock period = 12ns • CPI = (0.2*3+0.2*5+0.6*4) = 4 • Performance = 48 ns/insn
• Better, but not as exciting… • Can we do better still? • Have our cake (low CPI) and eat it too (high clock frequency)?
![Page 73: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/73.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 73
Clock Period and CPI
• Single-cycle datapath + Low CPI: 1 – Long clock period: to accommodate slowest insn
• Multi-cycle datapath + Short clock period – High CPI
• Can we have both low CPI and short clock period? – No good way to make a single insn go faster + Insn latency doesn’t matter anyway … insn throughput matters • Key: exploit inter-insn parallelism
insn0.fetch, dec, exec insn1.fetch, dec, exec
insn0.dec insn0.fetch insn1.dec insn1.fetch
insn0.exec insn1.exec
![Page 74: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/74.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 74
Pipelining
• Pipelining: important performance technique • Improves insn throughput rather than insn latency • Exploits parallelism at insn-stage level to do so • Begin with multi-cycle design
• When insn advances from stage 1 to 2, next insn enters stage 1
• Individual insns take same number of stages + But insns enter and leave at a much faster rate • Physically breaks “atomic” VN loop ... but must maintain illusion
• Automotive assembly line analogy
insn0.dec insn0.fetch insn1.dec insn1.fetch
insn0.exec insn1.exec
insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec
insn1.exec
![Page 75: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/75.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 75
5 Stage Multi-Cycle Datapath
P C
Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
I R D O
B
A
![Page 76: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/76.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 76
5 Stage Pipelined Datapath
• Temporary values (PC,IR,A,B,O,D) re-latched every stage • Why? 5 insns may be in pipeline at once, they share a single PC? • Notice, PC not latched after ALU stage (why not?)
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
![Page 77: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/77.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 77
Pipeline Terminology
• Stages: Fetch, Decode, eXecute, Memory, Writeback • Latches (pipeline registers): PC, F/D, D/X, X/M, M/W
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
![Page 78: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/78.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 78
Some More Terminology
• Scalar pipeline: one insn per stage per cycle • Alternative: “superscalar” (next unit)
• In-order pipeline: insns enter execute stage in VN order • Alternative: “out-of-order” (not covered in CSE 371)
• Pipeline depth: number of pipeline stages • Nothing magical about five • Trend has been to deeper pipelines
![Page 79: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/79.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 79
Pipeline Example: Cycle 1
• 3 instructions
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
add $3,$2,$1
![Page 80: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/80.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 80
Pipeline Example: Cycle 2
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
lw $4,0($5) add $3,$2,$1
![Page 81: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/81.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 81
Pipeline Example: Cycle 3
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
sw $6,4($7) lw $4,0($5) add $3,$2,$1
![Page 82: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/82.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 82
Pipeline Example: Cycle 4
• 3 instructions
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
sw $6,4($7) lw $4,0($5) add $3,$2,$1
![Page 83: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/83.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 83
Pipeline Example: Cycle 5
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
sw $6,4($7) lw $4,0($5) add
![Page 84: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/84.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 84
Pipeline Example: Cycle 6
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
sw $6,4(7) lw
![Page 85: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/85.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 85
Pipeline Example: Cycle 7
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
sw
![Page 86: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/86.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 86
Pipeline Diagram
• Pipeline diagram: shorthand for what we just saw • Across: cycles • Down: insns • Convention: X means lw $4,0($5) finishes execute stage and
writes into X/M latch at end of cycle 4
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($5) F D X M W
sw $6,4($7) F D X M W
![Page 87: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/87.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 87
What About Pipelined Control?
• Should it be like single-cycle control? • But individual insn signals must be staged
• Should it be like multi-cycle control? • But all stages are simultaneously active
• How many different controllers are we going to need? • One for each insn in pipeline?
• Solution: use simple single-cycle control, but pipeline it • Single controller
![Page 88: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/88.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 88
Pipelined Control
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
CTRL
xC
mC
wC
mC
wC
wC
![Page 89: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/89.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 89
Pipeline Performance Calculation
• Single-cycle • Clock period = 50ns, CPI = 1 • Performace = 50ns/insn
• Multi-cycle • Branch: 20% (3 cycles), load: 20% (5 cycles), other: 60% (4
cycles) • Clock period = 12ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
• Remember: latching overhead makes it 12, not 10 • Performance = 48ns/insn
• Pipelined • Clock period = 12ns • CPI = 1.5 (on average insn completes every 1.5 cycles) • Performance = 18ns/insn
![Page 90: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/90.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 90
Q1: Why Is Pipeline Clock Period …
• … > delay thru datapath / number of pipeline stages?
• Latches (FFs) add delay • Pipeline stages have different delays, clock period is max delay
• Both factors have implications for ideal number pipeline stages
![Page 91: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/91.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 91
Q2: Why Is Pipeline CPI…
• … > 1? • CPI for scalar in-order pipeline is 1 + stall penalties • Stalls used to resolve hazards
• Hazard: condition that jeopardizes VN illusion • Stall: artificial pipeline delay introduced to restore VN illusion
• Calculating pipeline CPI • Frequency of stall * stall cycles • Penalties add (stalls generally don’t overlap in in-order pipelines) • 1 + stall-freq1*stall-cyc1 + stall-freq2*stall-cyc2 + …
• Correctness/performance/MCCF • Long penalties OK if they happen rarely, e.g., 1 + 0.01 * 10 = 1.1 • Stalls also have implications for ideal number of pipeline stages
![Page 92: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/92.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 92
Dependences and Hazards
• Dependence: relationship between two insns • Data: two insns use same storage location • Control: one insn affects whether another executes at all • Not a bad thing, programs would be boring without them • Enforced by making older insn go before younger one
• Happens naturally in single-/multi-cycle designs • But not in a pipeline
• Hazard: dependence & possibility of wrong insn order • Effects of wrong insn order cannot be externally visible
• Stall: for order by keeping younger insn in same stage • Hazards are a bad thing: stalls reduce performance
![Page 93: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/93.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 93
Why Does Every Insn Take 5 Cycles?
• Could /should we allow add to skip M and go to W? No – It wouldn’t help: peak fetch still only 1 insn per cycle – Structural hazards: imagine add follows lw
PC Insn Mem
Register File
S X
s1 s2 d Data Mem
a
d
+ 4
<< 2
PC
IR
PC
A
B
IR
O
B
IR
O
D
IR
PC
F/D D/X X/M M/W
add $3,$2,$1 lw $4,0($5)
![Page 94: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/94.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 94
Structural Hazards
• Structural hazards • Two insns trying to use same circuit at same time
• E.g., structural hazard on regfile write port
• To fix structural hazards: proper ISA/pipeline design • Each insn uses every structure exactly once • For at most one cycle • Always at same stage relative to F
![Page 95: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/95.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 95
Data Hazards
• Let’s forget about branches and the control for a while • The three insn sequence we saw earlier executed fine…
• But it wasn’t a real program • Real programs have data dependences
• They pass values via registers and memory
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($5) sw $6,0($7)
Data Mem
a
d
O
D
IR
M/W
![Page 96: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/96.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 96
Data Hazards
• Would this “program” execute correctly on this pipeline? • Which insns would execute with correct inputs? • add is writing its result into $3 in current cycle – lw read $3 2 cycles ago → got wrong value – addi read $3 1 cycle ago → got wrong value • sw is reading $3 this cycle → OK (regfile timing: write first half)
add $3,$2,$1 lw $4,0($3) sw $3,0($7) addi $6,1,$3
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
Data Mem
a
d
O
D
IR
M/W
![Page 97: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/97.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 97
Memory Data Hazards
• What about data hazards through memory? No • lw following sw to same address in next cycle, gets right value • Why? DMem read/write take place in same stage
• Data hazards through registers? Yes (previous slide) • Occur because register write is 3 stages after register read • Can only read a register value 3 cycles after writing it
sw $5,0($1) lw $4,0($1)
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
Data Mem
a
d
O
D
IR
M/W
![Page 98: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/98.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 98
Fixing Register Data Hazards
• Can only read register value 3 cycles after writing it
• One way to enforce this: make sure programs don’t do it • Compiler puts two independent insns between write/read insn pair
• If they aren’t there already • Independent means: “do not interfere with register in question”
• Do not write it: otherwise meaning of program changes • Do not read it: otherwise create new data hazard
• Code scheduling: compiler moves around existing insns to do this • If none can be found, must use nops
• This is called software interlocks • MIPS: Microprocessor w/out Interlocking Pipeline Stages
![Page 99: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/99.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 99
Software Interlock Example add $3,$2,$1 lw $4,0($3) sw $7,0($3) add $6,$2,$8 addi $3,$5,4
• Can any of last three insns be scheduled between first two • sw $7,0($3)? No, creates hazard with add $3,$2,$1 • add $6,$2,$8? OK • addi $3,$5,4? No, lw would read $3 from it • Still need one more insn, use nop
add $3,$2,$1 add $6,$2,$8 nop lw $4,0($3) sw $7,0($3) addi $3,$5,4
![Page 100: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/100.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 100
Software Interlock Performance
• Same deal • Branch: 20%, load: 20%, store: 10%, other: 50%
• Software interlocks • 20% of insns require insertion of 1 nop • 5% of insns require insertion of 2 nops
• CPI is still 1 technically • But now there are more insns • #insns = 1 + 0.20*1 + 0.05*2 = 1.3 – 30% more insns (30% slowdown) due to data hazards
![Page 101: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/101.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 101
Hardware Interlocks
• Problem with software interlocks? Not compatible • Where does 3 in “read register 3 cycles after writing” come from?
• From structure (depth) of pipeline • What if next MIPS version uses a 7 stage pipeline?
• Programs compiled assuming 5 stage pipeline will break
• A better (more compatible) way: hardware interlocks • Processor detects data hazards and fixes them • Two aspects to this
• Detecting hazards • Fixing hazards
![Page 102: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/102.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 102
Detecting Data Hazards
• Compare F/D insn input register names with output register names of older insns in pipeline Hazard =
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
hazard
Data Mem
a
d
O
D
IR
M/W
![Page 103: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/103.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 103
Fixing Data Hazards
• Prevent F/D insn from reading (advancing) this cycle • Write nop into D/X.IR (effectively, insert nop in hardware) • Also reset (clear) the datapath control signals • Disable F/D latch and PC write enables (why?)
• Re-evaluate situation next cycle
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
hazard
nop
Data Mem
a
d
O
D
IR
M/W
![Page 104: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/104.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 104
Aside: Insert NOP/Reset Register
• Earlier: registers support separate clock, write enable • Useful for writes into register file • Also useful for implementing stalls
• Registers should also support synchronous reset (clear) • Useful for implementing stalls • Implement as additional hardwired 0 input to FF data mux • Resetting pipeline registers equivalent to inserting a NOP
• If NOP is all zeros • If zero means “don’t write” for all write-enable control signals • Design ISA/control signals to make sure this is the case
FF D Q
[RST:WE] FF
D Q
WE 0
2
![Page 105: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/105.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 105
Hardware Interlock Example: cycle 1
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
= 1
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
hazard
nop
Data Mem
a
d
O
D
IR
M/W
![Page 106: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/106.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 106
Hardware Interlock Example: cycle 2
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
= 1
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
hazard
nop
Data Mem
a
d
O
D
IR
M/W
![Page 107: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/107.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 107
Hardware Interlock Example: cycle 3
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD)
= 0
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
hazard
nop
Data Mem
a
d
O
D
IR
M/W
![Page 108: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/108.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 108
Pipeline Control Terminology
• Hardware interlock maneuver is called stall or bubble
• Mechanism is called stall logic • Part of more general pipeline control mechanism
• Controls advancement of insns through pipeline
• Distinguish from pipelined datapath control • Controls datapath at each stage • Pipeline control controls advancement of datapath control
![Page 109: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/109.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 109
Pipeline Diagram with Data Hazards
• Data hazard stall indicated with d* • Stall propagates to younger insns
• This is not good (why?)
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F d* d* D X M W
sw $6,4($7) F D X M W
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F d* d* D X M W
sw $6,4($7) F D X M W
![Page 110: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/110.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 110
Hardware Interlock Performance
• Same deal • Branch: 20%, load: 20%, store: 10%, other: 50%
• Hardware interlocks: same as software interlocks • 20% of insns require 1 cycle stall (I.e., insertion of 1 nop) • 5% of insns require 2 cycle stall (I.e., insertion of 2 nops)
• CPI = 1 * 0.20*1 + 0.05*2 = 1.3 • So, either CPI stays at 1 and #insns increases 30% (software) • Or, #insns stays at 1 (relative) and CPI increases 30% (hardware) • Same difference
• Anyway, we can do better
![Page 111: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/111.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 111
Observe
• Technically, this situation is broken • lw $4,0($3) has already read $3 from regfile • add $3,$2,$1 hasn’t yet written $3 to regfile
• But fundamentally, everything is OK • lw $4,0($3) hasn’t actually used $3 yet • add $3,$2,$1 has already computed $3
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
Data Mem
a
d
O
D
IR
M/W
![Page 112: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/112.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 112
Bypassing
• Bypassing • Reading a value from an intermediate (µarchitectural) source • Not waiting until it is available from primary source • Here, we are bypassing the register file • Also called forwarding
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
Data Mem
a
d
O
D
IR
M/W
![Page 113: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/113.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 113
WX Bypassing
• What about this combination? • Add another bypass path and MUX input • First one was an MX bypass • This one is a WX bypass
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
Data Mem
a
d
O
D
IR
M/W
![Page 114: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/114.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 114
ALUinB Bypassing
• Can also bypass to ALU input B
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
add $3,$2,$1 add $4,$2,$3
Data Mem
a
d
O
D
IR
M/W
![Page 115: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/115.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 115
WM Bypassing?
• Does WM bypassing make sense? • Not to the address input (why not?) • But to the store data input, yes
Register File
S X
s1 s2 d Data Mem
a
d
IR
A
B
IR
O
B
IR
O
D
IR
F/D D/X X/M M/W
lw $3,0($2) sw $3,0($4)
![Page 116: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/116.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 116
Bypass Logic
• Each MUX has its own, here it is for MUX ALUinA (D/X.IR.RS1 == X/M.IR.RD) => 0 (D/X.IR.RS1 == M/W.IR.RD) => 1 Else => 2
Register File
S X
s1 s2 d
IR
A
B
IR
O
B
IR
F/D D/X X/M
Data Mem
a
d
O
D
IR
M/W
bypass
![Page 117: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/117.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 117
Bypass and Stall Logic
• Two separate things • Stall logic controls pipeline registers • Bypass logic controls MUXs
• But complementary • For a given data hazard: if can’t bypass, must stall
• Slide #43 shows full bypassing: all bypasses possible • Is stall logic still necessary?
![Page 118: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/118.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 118
Yes, Load Output to ALU Input
Stall = (D/X.IR.OP == LOAD) && ((F/D.IR.RS1 == D/X.IR.RD) || ((F/D.IR.RS2 == D/X.IR.RD) && (F/D.IR.OP != STORE))
Register File
S X
s1 s2 d Data Mem
a
d
IR
A
B
IR
O
B
IR
O
D
IR
F/D D/X X/M M/W
lw $3,0($2) stall
nop
add $4,$2,$3
lw $3,0($2) add $4,$2,$3
![Page 119: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/119.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 119
Pipeline Diagram With Bypassing
• Use compiler scheduling to reduce load-use stall frequency • Like software interlocks, but for performance not correctness
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F D X M W
addi $6,$4,1 F d* D X M W
1 2 3 4 5 6 7 8 9
add $3,$2,$1 F D X M W
lw $4,0($3) F D X M W
sub $8,$3,$1 F D X M W
addi $6,$4,1 F D X M W
![Page 120: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/120.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 120
Control Hazards
• Control hazards • Must fetch post branch insns before branch outcome is known • Default: assume “not-taken” (at fetch, can’t tell it’s a branch)
PC Insn Mem
Register File
s1 s2 d
+ 4
<< 2
F/D D/X
X/M
PC
A
B
IR
O
B
IR
PC
IR
S X
![Page 121: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/121.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 121
Branch Recovery
• Branch recovery: what to do when branch is actually taken • Insns that will be written into F/D and D/X are wrong • Flush them, i.e., replace them with nops + They haven’t had written permanent state yet (regfile, DMem)
PC Insn Mem
Register File
s1 s2 d
+ 4
<< 2
F/D D/X
X/M
nop nop
PC
A
B
IR
O
B
IR
PC
IR
S X
![Page 122: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/122.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 122
Branch Recovery Pipeline Diagram
• Convention: don’t fill in flushed insns • Taken branch penalty is 2 cycles
1 2 3 4 5 6 7 8 9
addi $3,$0,1 F D X M W
bnez $3,targ F D X M W
sw $6,4($7) F D
targ: addi $8,$7,1 F
targ: addi $8,$7,1 F D X M W
1 2 3 4 5 6 7 8 9
addi $3,$0,1 F D X M W
bnez $3,targ F D X M W
targ: addi $8,$7,1 F D X M W
![Page 123: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/123.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 123
Branch Performance
• Back of the envelope calculation • Branch: 20%, load: 20%, store: 10%, other: 50% • 75% of branches are taken
• CPI = 1 + 0.20*0.75*2 = 1.3 – Branches cause 30% slowdown • How do we reduce this penalty?
![Page 124: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/124.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 124
Fast Branch
• Fast branch: can decide at D, not X • Test must be comparison to zero or equality, no time for ALU + New taken branch penalty is 1 – Additional insns (slt) for more complex tests, must bypass to D too • 25% of branches have complex tests that require extra insn • CPI = 1 + 0.20*0.75*1(branch) + 0.20*0.25*1(extra insn) = 1.2
PC Insn Mem
Register File
s1 s2 d
+ 4
<< 2
F/D
D/X X/M S X
<> 0
O
B
IR
A
B
IR
PC
IR
S X
![Page 125: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/125.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 125
Speculative Execution
• Speculation: “risky transactions on chance of profit”
• Speculative execution • Execute before all parameters known with certainty • Correct speculation
+ Avoid stall, improve performance • Incorrect speculation (mis-speculation)
– Must abort/flush/squash incorrect insns – Must undo incorrect changes (recover pre-speculation state)
• The “game”: [%correct * gain] – [(1–%correct) * penalty]
• Control speculation: speculation aimed at control hazards • Unknown parameter: are these the correct insns to execute next?
![Page 126: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/126.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 126
Control Speculation Mechanics
• Guess branch target, start fetching at guessed position • Doing nothing is implicitly guessing target is PC+4 • Can actively guess other targets: dynamic branch prediction
• Execute branch to verify (check) guess • Correct speculation? keep going • Mis-speculation? Flush mis-speculated insns
• Hopefully haven’t modified permanent state (Regfile, DMem) + Happens naturally in in-order 5-stage pipeline
• “Game” for in-order 5 stage pipeline • %correct = ? • Gain = 2 cycles + Penalty = 0 cycles → mis-speculation no worse than stalling
![Page 127: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/127.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 127
Dynamic Branch Prediction
• Dynamic branch prediction: guess outcome • Start fetching from guessed address • Flush on mis-prediction (notice new recovery circuit)
PC Insn Mem
Register File
S X
s1 s2 d
+ 4
<< 2
TG PC
IR
TG PC
A
B
IR
O
B
IR
F/D D/X X/M
nop nop
BP
<>
![Page 128: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/128.jpg)
Branch Prediction: Short Summary
• Key principle of micro-architecture: • Programs do the same thing over and over (why?)
• Exploit for performance: • Learn what a program did before • Guess that it will do the same thing again
• Details of branch prediction: later (~1 month) • For now, just know it can be done and is important to performance
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 128
![Page 129: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/129.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 129
Branch Prediction Performance
• Dynamic branch prediction • Simple predictor: branches predicted with 75% accuracy • CPI = 1 + 0.20*0.25*2 = 1.1 • More advanced predictor: 95% accuracy • CPI = 1 + 0.20*0.05*2 = 1.02
• Branch mis-predictions still a big problem though • Pipelines are long: typical mis-prediction penalty is 10+ cycles • Pipelines have full bypassing: compiler schedules the rest • Pipelines are superscalar (later)
![Page 130: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/130.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 130
Pipelining And Exceptions
• Pipelining makes exceptions nasty • 5 insns in pipeline at once • Exception happens, how do you know which insn caused it?
• Exceptions propagate along pipeline in latches • Two exceptions happen, how do you know which one to take first?
• One belonging to oldest insn • When handling exception, have to flush younger insns
• Piggy-back on branch mis-prediction machinery to do this • What about multi-cycle operations?
• Just FYI
![Page 131: CS104 Computer Organization and Design](https://reader030.vdocuments.net/reader030/viewer/2022012506/61818a546c9b4f3cfb6b2c03/html5/thumbnails/131.jpg)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 131
Pipeline Depth
• No magic about 5 stages, trend had been to deeper pipelines • 486: 5 stages (50+ gate delays / clock) • Pentium: 7 stages • Pentium II/III: 12 stages • Pentium 4: 22 stages (~10 gate delays / clock) “super-pipelining” • Core1/2: 14 stages
• Increasing pipeline depth + Increases clock frequency (reduces period) – But decreases IPC (increases CPI) • Branch mis-prediction penalty becomes longer • Non-bypassed data hazard stalls become longer • At some point, CPI losses offset clock gains, question is when?
• 1GHz Pentium 4 was slower than 800 MHz PentiumIII • What was the point? People by frequency, not frequency * IPC