implementing a mips processor• break up the logic with latches into “pipeline stages” • each...
TRANSCRIPT
![Page 1: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/1.jpg)
Implementing a MIPS Processor
Readings: 4.1-4.11
1
![Page 2: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/2.jpg)
Goals for this Class
2
• Understand how CPUs run programs• How do we express the computation the CPU?• How does the CPU execute it?• How does the CPU support other system components (e.g., the OS)?• What techniques and technologies are involved and how do they work?
• Understand why CPU performance (and other metrics) varies• How does CPU design impact performance?• What trade-offs are involved in designing a CPU?• How can we meaningfully measure and compare computer systems?
• Understand why program performance varies• How do program characteristics affect performance?• How can we improve a programs performance by considering the CPU
running it?• How do other system components impact program performance?
![Page 3: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/3.jpg)
Goals• Understand how the 5-stage MIPS pipeline
works• See examples of how architecture impacts ISA
design• Understand how the pipeline affects performance
• Understand hazards and how to avoid them• Structural hazards• Data hazards• Control hazards
3
![Page 4: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/4.jpg)
Processor Design in Two Acts
Act I: A single-cycle CPU
![Page 5: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/5.jpg)
Foreshadowing
• Act I: A Single-cycle Processor• Simplest design – Not how many real machines
work (maybe some deeply embedded processors)• Figure out the basic parts; what it takes to execute
instructions
• Act II: A Pipelined Processor• This is how many real machines work• Exploit parallelism by executing multiple instructions
at once.
![Page 6: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/6.jpg)
Target ISA• We will focus on part of MIPS
• Enough to run into the interesting issues• Memory operations• A few arithmetic/Logical operations (Generalizing is
straightforward)• BEQ and J
• This corresponds pretty directly to what you’ll be implementing in 141L.
6
![Page 7: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/7.jpg)
Basic Steps for Execution• Fetch an instruction from the instruction store• Decode it
• What does this instruction do?• Gather inputs
• From the register file• From memory
• Perform the operation• Write back the outputs
• To register file or memory
• Determine the next instruction to execute
7
![Page 8: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/8.jpg)
The Processor Design Algorithm• Once you have an ISA…• Design/Draw the datapath
• Identify and instantiate the hardware for your architectural state• Foreach instruction
• Simulate the instruction• Add and connect the datapath elements it requires• Is it workable? If not, fix it.
• Design the control• Foreach instruction
• Simulate the instruction• What control lines do you need?• How will you compute their value?• Modify control accordingly• Is it workable? If not, fix it.
• You’ve already done much of this in 141L.
8
![Page 9: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/9.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 10: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/10.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 11: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/11.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 12: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/12.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 13: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/13.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 14: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/14.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 15: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/15.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 16: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/16.jpg)
• Arithmetic; R-Type• Inst = Mem[PC]• REG[rd] = REG[rs] op REG[rt] • PC = PC + 4
bits 31:26 25:21 20:16 15:11 10:6 5:0name op rs rt rd shamt funct# bits 6 5 5 5 5 6
![Page 17: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/17.jpg)
• ADDI; I-Type• PC = PC + 4• REG[rd] = REG[rs] op SignExtImm
bits 31:26 25:21 20:16 15:0name op rs rt imm# bits 6 5 5 16
10
![Page 18: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/18.jpg)
• ADDI; I-Type• PC = PC + 4• REG[rd] = REG[rs] op SignExtImm
bits 31:26 25:21 20:16 15:0name op rs rt imm# bits 6 5 5 16
10
![Page 19: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/19.jpg)
• Load Word• PC = PC + 4• REG[rt] = MEM[signextendImm + REG[rs]]
bits 31:26 25:21 20:16 15:0name op rs rt immediate# bits 6 5 5 16
11
![Page 20: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/20.jpg)
• Load Word• PC = PC + 4• REG[rt] = MEM[signextendImm + REG[rs]]
bits 31:26 25:21 20:16 15:0name op rs rt immediate# bits 6 5 5 16
11
![Page 21: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/21.jpg)
• Store Word• PC = PC + 4• MEM[signextendImm + REG[rs]] = REG[rt]
bits 31:26 25:21 20:16 15:0name op rs rt immediate# bits 6 5 5 16
12
![Page 22: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/22.jpg)
• Store Word• PC = PC + 4• MEM[signextendImm + REG[rs]] = REG[rt]
bits 31:26 25:21 20:16 15:0name op rs rt immediate# bits 6 5 5 16
12
![Page 23: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/23.jpg)
• Branch-equal; I-Type• PC = (REG[rs] == REG[rt]) ? PC + 4 + SignExtImmediate *4 :
PC + 4;bits 31:26 25:21 20:16 15:0
name op rs rt displacement# bits 6 5 5 16
13
![Page 24: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/24.jpg)
• Branch-equal; I-Type• PC = (REG[rs] == REG[rt]) ? PC + 4 + SignExtImmediate *4 :
PC + 4;bits 31:26 25:21 20:16 15:0
name op rs rt displacement# bits 6 5 5 16
13
![Page 25: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/25.jpg)
A Single-cycle Processor• Performance refresher• ET = IC * CPI * CT• Single cycle ⇒ CPI == 1; That sounds great
• Unfortunately, Single cycle ⇒ CT is large
• Even RISC instructions take quite a bite of effort to execute
• This is a lot to do in one cycle
14
![Page 26: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/26.jpg)
15
Our Hardware is Mostly IdleCycle time = 15 ns
Slowest module (alu) is ~6ns
![Page 27: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/27.jpg)
15
Our Hardware is Mostly IdleCycle time = 15 ns
Slowest module (alu) is ~6ns
![Page 28: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/28.jpg)
Processor Design in Two Acts
Act II: A pipelined CPU
![Page 29: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/29.jpg)
Pipelining Review
17
![Page 30: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/30.jpg)
Pipelining
What’s the latency for one unit of work?
What’s the throughput?
Logic (10ns)
Latch
Latc
Logic (10ns)
Latch
Latc
Logic (10ns)
Latch
Latc
10n
20n
30n
Cycle 1
Cycle 2
Cycle 3
![Page 31: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/31.jpg)
Pipelining• Break up the logic with latches into “pipeline
stages”• Each stage can act on different data• Latches hold the inputs to their stage• Every clock cycle data transfers from one
pipe stage to the next
Logic (10ns)
Latch
Latch
Logic(2ns)
Latch
Latch
Logic(2ns)
Latch
Logic(2ns)
Latch
Logic(2ns)
Latch
Logic(2ns)
Latch
![Page 32: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/32.jpg)
Logic Logic Logic Logic Logic
Logic Logic Logic Logic Logic
Logic Logic Logic Logic Logic
Logic Logic Logic Logic Logic
Logic Logic Logic Logic Logic
Logic Logic Logic Logic Logic
2ns
4ns
6ns
8ns
10ns
12ns
What’s the latency for one unit of work?
What’s the throughput?
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
Pipelining
![Page 33: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/33.jpg)
• Critical path is the longest possible delay between two registers in a design.
• The critical path sets the cycle time, since the cycle time must be long enough for a signal to traverse the critical path.
• Lengthening or shortening non-critical paths does not change performance
• Ideally, all paths are about the same length
21
Critical path review
Logic
![Page 34: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/34.jpg)
22
Pipelining and Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
Logic
• Hopefully, critical path reduced by 1/3
![Page 35: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/35.jpg)
• You cannot pipeline forever• Some logic cannot be pipelined arbitrarily -- Memories• Some logic is inconvenient to pipeline.• How do you insert a register in the middle of an multiplier?
• Registers have a cost• They cost area -- choose “narrow points” in the logic• They cost energy -- latches don’t do any useful work• They cost time
• Extra logic delay• Set-up and hold times.
• Pipelining may not affect the critical path as you expect
Limitations of Pipelining
23
Logic
Logic
Logic
Logic
Logic
Logic
![Page 36: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/36.jpg)
Pipelining Overhead
• Logic Delay (LD) -- How long does the logic take (i.e., the useful part)
• Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready?• Relatively short -- 0.036 ns on our FPGAs
• Register delay (RD) -- Delay through the internals of the register.• Longer -- 1.5 ns for our FPGAs.• Much, much shorter for RAW CMOS.
24
![Page 37: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/37.jpg)
Pipelining Overhead
25
????? New Data
Setup time
Clock
Register in
Old data New dataRegister out
Register delay
![Page 38: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/38.jpg)
Pipelining Overhead• Logic Delay (LD) -- How long does the logic take
(i.e., the useful part)• Set up time (ST) -- How long before the clock
edge do the inputs to a register need be ready?• Register delay (RD) -- Delay through the
internals of the register.• CTbase -- cycle time before pipelining
• CTbase = LD + ST + RD.
• CTpipe -- cycle time after pipelining N times• CTpipe = ST + RD + LD/N• Total time = N*ST + N*RD + LD
26
![Page 39: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/39.jpg)
• You can’t always pipeline how you would like
Pipelining Difficulties
27
Fast
Logic
Slow Logic
Slow Logic
Fast
Logic
Fast
Logic
Slow Logic
Slow Logic
Fast
Logic
![Page 40: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/40.jpg)
How to pipeline a processor• Break each instruction into pieces -- remember
the basic algorithm for execution• Fetch• Decode• Collect arguments• Execute• Write back results• Compute next PC
• The “classic 5-stage MIPS pipeline”• Fetch -- read the instruction• Decode -- decode and read from the register file• Execute -- Perform arithmetic ops and address calculations• Memory -- access data memory.• Write back-- Store results in the register file.
28
![Page 41: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/41.jpg)
29
Pipelining a processor
![Page 42: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/42.jpg)
29
Pipelining a processorEX
Decodereg read
Fetch Mem
Reg Writeback
Reality
![Page 43: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/43.jpg)
29
Pipelining a processorEX
Decodereg read
Fetch Mem
Reg Writeback
Reality
EXDecodereg read
Fetch Mem Reg Writeback Easier to draw
![Page 44: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/44.jpg)
29
Pipelining a processorEX
Decodereg read
Fetch Mem
Reg Writeback
Reality
EXDecodereg read
Fetch Mem Reg Writeback Easier to draw
EXDecode/reg read
Fetch Mem Reg Writeback
Pipelined
![Page 45: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/45.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadData
SignExtend
![Page 46: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/46.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadDataIF
etch/D
ec
Dec/Exec
Exec/M
em
Mem
/WB
SignExtend
![Page 47: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/47.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadData
SignExtend
add …lw …Sub…Sub ….Add …Add …
![Page 48: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/48.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadData
SignExtend
add …lw …Sub…Sub ….Add …Add …
![Page 49: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/49.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadData
SignExtend
add …lw …Sub…Sub ….Add …Add …
![Page 50: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/50.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadData
SignExtend
add …lw …Sub…Sub ….Add …Add …
![Page 51: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/51.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadData
SignExtend
add …lw …Subi…Sub ….Add …Add …
![Page 52: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/52.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadDataIF
etch/D
ec
Dec/Exec
Exec/M
em
Mem
/WB
SignExtend
![Page 53: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/53.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadDataIF
etch/D
ec
Dec/Exec
Exec/M
em
Mem
/WB
SignExtend
This is a bugrd must flow through the pipeline with the instruc>on.This signal needs to come from the WB stage.
![Page 54: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/54.jpg)
Pipelined Datapath
ReadAddress
Instruc(onMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadDataIF
etch/D
ec
Dec/Exec
Exec/M
em
Mem
/WB
SignExtend
This is a bugrd must flow through the pipeline with the instruc>on.This signal needs to come from the WB stage.
![Page 55: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/55.jpg)
Pipelined Control• Control lives in decode stage, signals flow down the pipe
with the instructions they correspond to
38
ReadAddress
Instruc(onMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shi<le< 2
Add
DataMemory
Address
Write Data
ReadDataIF
etch/D
ec
Dec/Exec
Exec/M
em
Mem
/WB
SignExtend
Control
ex ctrl signals mem ctrl signals
Write back ctrl signals
![Page 56: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/56.jpg)
• L = IC * CPI * CT• Break the processor into P pipe stages
• CTnew = CT/P• CPInew = CPIold
• CPI is an average: Cycles/instructions• The latency of one instruction is P cycles• The average CPI = 1
• ICnew = ICold
• Total speedup should be 5x!• Except for the overhead of the pipeline registers• And the realities of logic design...
39
Impact of Pipelining
![Page 57: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/57.jpg)
![Page 58: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/58.jpg)
![Page 59: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/59.jpg)
![Page 60: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/60.jpg)
Imem 2.77 ns
![Page 61: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/61.jpg)
Ctrl 0.797 nsImem 2.77 ns
![Page 62: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/62.jpg)
Ctrl 0.797 nsImem 2.77 ns
ArgBMux 1.124 ns
![Page 63: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/63.jpg)
Ctrl 0.797 nsImem 2.77 ns
ArgBMux 1.124 ns
ALU 5.94ns
![Page 64: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/64.jpg)
Dmem 2.098 ns
![Page 65: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/65.jpg)
![Page 66: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/66.jpg)
ALU 5.94ns
![Page 67: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/67.jpg)
![Page 68: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/68.jpg)
WriteRegMux 2.628 ns
![Page 69: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/69.jpg)
![Page 70: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/70.jpg)
RegFile 1.424 ns
![Page 71: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/71.jpg)
Mux
ALU
DMem read
Mux
Control
+
RegFile Write
DMem write
PC
Cycle Time
RegFile ReadIMem
![Page 72: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/72.jpg)
Mux
ALU
DMem read
Mux
Control
+
RegFile Write
DMem write
PC
Cycle Time
RegFile ReadIMem
![Page 73: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/73.jpg)
• What we want: 5 balanced pipeline stages• Each stage would take 5.266 ns• Clock speed: 316 Mhz• Speedup: 5x
• What’s easy: 5 unbalanced pipeline stages• Longest stage: 6.14 ns <- this is the critical path• Shortest stage: 0.29 ns• Clock speed: 163Mhz• Speedup: 2.6x
44
![Page 74: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/74.jpg)
• What we want: 5 balanced pipeline stages• Each stage would take 5.266 ns• Clock speed: 316 Mhz• Speedup: 5x
• What’s easy: 5 unbalanced pipeline stages• Longest stage: 6.14 ns <- this is the critical path• Shortest stage: 0.29 ns• Clock speed: 163Mhz• Speedup: 2.6x
Mux
ALU
DMem read
Mux
Control
+
RegFile Write
DMem write
PC
Cycle Time
RegFile ReadIMem
44
![Page 75: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/75.jpg)
• What we want: 5 balanced pipeline stages• Each stage would take 5.266 ns• Clock speed: 316 Mhz• Speedup: 5x
• What’s easy: 5 unbalanced pipeline stages• Longest stage: 6.14 ns <- this is the critical path• Shortest stage: 0.29 ns• Clock speed: 163Mhz• Speedup: 2.6x
Cycle time = 15.8 nsClock Speed = 63Mhz
Mux
ALU
DMem read
Mux
Control
+
RegFile Write
DMem write
PC
Cycle Time
RegFile ReadIMem
44
![Page 76: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/76.jpg)
• What we want: 5 balanced pipeline stages• Each stage would take 5.266 ns• Clock speed: 316 Mhz• Speedup: 5x
• What’s easy: 5 unbalanced pipeline stages• Longest stage: 6.14 ns <- this is the critical path• Shortest stage: 0.29 ns• Clock speed: 163Mhz• Speedup: 2.6x
Cycle time = 15.8 nsClock Speed = 63Mhz
Mux
ALU
DMem read
Mux
Control
+
RegFile Write
DMem write
PC
Cycle Time
RegFile ReadIMem
44
![Page 77: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/77.jpg)
• What we want: 5 balanced pipeline stages• Each stage would take 5.266 ns• Clock speed: 316 Mhz• Speedup: 5x
• What’s easy: 5 unbalanced pipeline stages• Longest stage: 6.14 ns <- this is the critical path• Shortest stage: 0.29 ns• Clock speed: 163Mhz• Speedup: 2.6x
Cycle time = 15.8 nsClock Speed = 63Mhz
Pipelining inefficiencies reduces our improvement in CT from 5x to 2.6x
Mux
ALU
DMem read
Mux
Control
+
RegFile Write
DMem write
PC
Cycle Time
RegFile ReadIMem
44
![Page 78: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/78.jpg)
A Structural Hazard• Both the decode and write back stage have
to access the register file.• There is only one registers file. A structural
hazard!!• Solution: Write early, read late
• Writes occur at the clock edge and complete long before the end of the cycle
• This leave enough time for the outputs to settle for the reads.
• Hazard avoided!
•45
EX
DecodeFetch Mem Write
back
![Page 79: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/79.jpg)
Quiz 41. For my CPU, CPI = 1 and CT = 4ns.
a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline?
b. What will the new CT, CPI, and IC be in the ideal case? c. Give 2 reasons why achieving the speedup in (a) will be
very difficult.2. Instructions in a VLIW instruction word are executed
a. Sequentiallyb. In parallelc. Conditionallyd. Optionally
3. List x86, MIPS, and ARM in order from most CISC-like to most RISC-like.
4. List 3 characteristics of MIPS that make it a RISC ISA.5. Give one advantage of the multiple, complex addressing
modes that ARM and x86 provide.46
![Page 80: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/80.jpg)
HW 1 Score Distribution
47
0
2
4
6
8
10
12
14
16
18
20
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 81: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/81.jpg)
HW 2 Score Distribution
48
0
2
4
6
8
10
12
14
16
18
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 82: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/82.jpg)
Projected Grade Distribution
49
0
2
4
6
8
10
12
14
16
18
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 83: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/83.jpg)
What You Like
• More “real world” stuff.• More class interactions.• Cool “factoids” side notes, etc.• Example in the slides.• Jokes
50
![Page 84: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/84.jpg)
What You Would Like to See Improved• More time for the quizzes.• Late-breaking changes to the homeworks.• Too much discussion of latency and metrics.• It’d be nice to have the slides before lecture.• Too much math.• Not enough time to ask questions• Going too fast.• MIPS is not the most relevant ISA.• The subject of the class.• More precise instructions for HW and quizzes.• The pace should be faster• The pace should be slower• The software (SPIM, Quartus, and Modelsim) are hard to
use51
![Page 85: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/85.jpg)
Pipelining is Tricky• Simple pipelining is easy
• If the data flows in one direction only• If the stages are independent• In fact the tool can do this automatically via
“retiming” (If you are curious, experiment with this in Quartus).
• Not so, for processors. • Branch instructions affect the next PC -- backward
flow• Instructions need values computed by previous
instructions -- not independent
52
![Page 86: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/86.jpg)
Not just tricky, Hazardous!• Hazards are situations where pipelining does not
work as elegantly as we would like• Three kinds
• Structural hazards -- we have run out of a hardware resource.
• Data hazards -- an input is not available on the cycle it is needed.
• Control hazards -- the next instruction is not known.
• Dealing with hazards increases complexity or decreases performance (or both)
• Dealing efficienctly with hazards is much of what makes processor design hard.• That, and the Quartus tools ;-)
53
![Page 87: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/87.jpg)
Hazards: Key Points• Hazards cause imperfect pipelining
• They prevent us from achieving CPI = 1• They are generally causes by “counter flow” data dependences in
the pipeline
• Three kinds• Structural -- contention for hardware resources• Data -- a data value is not available when/where it is needed.• Control -- the next instruction to execute is not known.
• Two ways to deal with hazards• Removal -- add hardware and/or complexity to work around the
hazard so it does not occur• Stall -- Hold up the execution of new instructions. Let the older
instructions finish, so the hazard will clear.
54
![Page 88: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/88.jpg)
Data Dependences• A data dependence occurs whenever one
instruction needs a value produced by another.• Register values• Also memory accesses (more on this later)
55
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
add $t3, $t2, $t4
![Page 89: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/89.jpg)
Data Dependences• A data dependence occurs whenever one
instruction needs a value produced by another.• Register values• Also memory accesses (more on this later)
55
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
add $t3, $t2, $t4
![Page 90: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/90.jpg)
Data Dependences• A data dependence occurs whenever one
instruction needs a value produced by another.• Register values• Also memory accesses (more on this later)
55
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
add $t3, $t2, $t4
![Page 91: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/91.jpg)
Data Dependences• A data dependence occurs whenever one
instruction needs a value produced by another.• Register values• Also memory accesses (more on this later)
55
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
add $t3, $t2, $t4
![Page 92: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/92.jpg)
Data Dependences• A data dependence occurs whenever one
instruction needs a value produced by another.• Register values• Also memory accesses (more on this later)
55
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
add $t3, $t2, $t4
![Page 93: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/93.jpg)
Data Dependences• A data dependence occurs whenever one
instruction needs a value produced by another.• Register values• Also memory accesses (more on this later)
55
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
add $t3, $t2, $t4
![Page 94: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/94.jpg)
• In our simple pipeline, these instructions cause a data hazard
Dependences in the pipeline
56
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writebacksub $t2, $s0, $t3
TimeCyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5
![Page 95: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/95.jpg)
• In our simple pipeline, these instructions cause a data hazard
Dependences in the pipeline
56
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writebacksub $t2, $s0, $t3
TimeCyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5
![Page 96: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/96.jpg)
• In our simple pipeline, these instructions cause a data hazard
Dependences in the pipeline
56
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writebacksub $t2, $s0, $t3
TimeCyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5
![Page 97: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/97.jpg)
How can we fix it?• Ideas?
57
![Page 98: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/98.jpg)
Solution 1: Make the compiler deal with it.• Expose hazards to the big A architecture
• A result is available N instructions after the instruction that generates it.
• In the meantime, the register file has the old value.• This is called “a register delay slot”
• What is N? Can it change?
58
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writeback
sub $t2, $s0, $t3
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
delay slot
delay slot
![Page 99: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/99.jpg)
Solution 1: Make the compiler deal with it.• Expose hazards to the big A architecture
• A result is available N instructions after the instruction that generates it.
• In the meantime, the register file has the old value.• This is called “a register delay slot”
• What is N? Can it change?
58
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writeback
sub $t2, $s0, $t3
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
delay slot
delay slot
N = 2, for our design
![Page 100: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/100.jpg)
• The compiler must fill the delay slots• Ideally, with useful instructions, but nops will
work too.
Compiling for delay slots
59
add $s0, $t0, $t1
sub $t2, $s0, $t3
add $t3, $s0, $t4
and $t7, $t5, $t4
add $s0, $t0, $t1
and $t7, $t5, $t4
nop
sub $t2, $s0, $t3
add $t3, $s0, $t4
![Page 101: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/101.jpg)
Solution 2: Stall• When you need a value that is not ready, “stall”
• Suspend the execution of the executing instruction• and those that follow.• This introduces a pipeline “bubble.”
• A bubble is a lack of work to do, it propagates through the pipeline like nop instructions
Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
EX Mem Writeback
EX Mem Writeback
sub $t2, $s0, $t3Fetch EXDeco
deMem Write
backDeco
deDeco
de
Fetch EXDecode
Mem Writeback
FetchFetchadd $t3, $s0, $t4
NOPs in the bubble
Cyc 9Cyc 10
![Page 102: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/102.jpg)
Solution 2: Stall• When you need a value that is not ready, “stall”
• Suspend the execution of the executing instruction• and those that follow.• This introduces a pipeline “bubble.”
• A bubble is a lack of work to do, it propagates through the pipeline like nop instructions
Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
EX Mem Writeback
EX Mem Writeback
sub $t2, $s0, $t3Fetch EXDeco
deMem Write
backDeco
deDeco
de
Fetch EXDecode
Mem Writeback
FetchFetchadd $t3, $s0, $t4
NOPs in the bubble
Cyc 9Cyc 10
Both of these instructions are stalled
![Page 103: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/103.jpg)
Solution 2: Stall• When you need a value that is not ready, “stall”
• Suspend the execution of the executing instruction• and those that follow.• This introduces a pipeline “bubble.”
• A bubble is a lack of work to do, it propagates through the pipeline like nop instructions
Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
EX Mem Writeback
EX Mem Writeback
sub $t2, $s0, $t3Fetch EXDeco
deMem Write
backDeco
deDeco
de
Fetch EXDecode
Mem Writeback
FetchFetchadd $t3, $s0, $t4
NOPs in the bubble
Cyc 9Cyc 10
Both of these instructions are stalled
One instruction or nop completes
each cycle
![Page 104: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/104.jpg)
Stalling the pipeline• Freeze all pipeline stages before the stage
where the hazard occurred.• Disable the PC update• Disable the pipeline registers
• This is equivalent to inserting into the pipeline when a hazard exists• Insert nop control bits at stalled stage (decode in our
example) • How is this solution still potentially “better” than
relying on the compiler?
61
![Page 105: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/105.jpg)
Stalling the pipeline• Freeze all pipeline stages before the stage
where the hazard occurred.• Disable the PC update• Disable the pipeline registers
• This is equivalent to inserting into the pipeline when a hazard exists• Insert nop control bits at stalled stage (decode in our
example) • How is this solution still potentially “better” than
relying on the compiler?
61
The compiler can still act like there are delay slots to avoid stalls.Implementation details are not exposed in the ISA
![Page 106: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/106.jpg)
Calculating CPI for Stalls• In this case, the bubble lasts for 2 cycles.• As a result, in cycle (6 and 7), no instruction completes.• What happens to CPI?
• In the absence of stalls, CPI is one, since one instruction completes per cycle
• If an instruction stalls for N cycles, it’s CPI goes up by N
Cyc 1 Cyc 2 Cyc 3 Cyc 4Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10
add $s0, $t0, $t1
sub $t2, $s0, $t3
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Fetch EXDecode
Mem Writeback
Decode
Decode
Fetch EXDecode
Mem Writeback
FetchFetchadd $t3, $s0, $t4
and $t3, $t2, $t4
![Page 107: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/107.jpg)
Midterm and Midterm Review• Midterm in 1 week!• Midterm review in 2 days!
• I’ll just take questions from you.• Come prepared with questions.
• Today• Go over last 2 quizzes• More about hazards
63
![Page 108: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/108.jpg)
Quiz 41. For my CPU, CPI = 1 and CT = 4ns.
a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline?
a. 8xb. What will the new CT, CPI, and IC be in the ideal case?
a. CT = CT/8, CPI = 1, IC = IC c. Give 2 reasons why achieving the speedup in (a) will be very difficult.
a. Setup time; Difficult-to-pipeline logic; register logic delay.2. Instructions in a VLIW instruction word are executed
a. Sequentiallyb. In parallelc. Conditionallyd. Optionally
3. List x86, MIPS, and ARM in order from most CISC-like to most RISC-like.1. x86, ARM, MIPS
4. List 3 characteristics of MIPS that make it a RISC ISA.1. Fixed inst size; few inst formats; Orthogonality; general purpose registers; Designed for
fast implementations; few addressing modes.5. Give one advantage of the multiple, complex addressing modes that ARM and x86 provide.
1. Reduced code size2. User-friendliness for hand-written assembly
64
![Page 109: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/109.jpg)
Hardware for Stalling• Turn off the enables on the earlier pipeline stages
• The earlier stages will keep processing the same instruction over and over.• No new instructions get fetched.
• Insert control and data values corresponding to a nop into the “downstream” pipeline register.• This will create the bubble.• The nops will flow downstream, doing nothing.
• When the stall is over, re-enable the pipeline registers• The instructions in the “upstream” stages will start moving again.• New instructions will start entering the pipeline again.
65
EXDecode/reg read
Fetch Mem Writeback
0
0
inject_nop_decode_execute
enbenb
stall_fetch_decodestall_pc
![Page 110: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/110.jpg)
The Impact of Stalling On Performance
• ET = I * CPI * CT• I and CT are constant• What is the impact of stalling on CPI?
• What do we need to know to figure it out?
66
![Page 111: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/111.jpg)
The Impact of Stalling On Performance
• ET = I * CPI * CT• I and CT are constant• What is the impact of stalling on CPI?
• Fraction of instructions that stall: 30%• Baseline CPI = 1• Stall CPI = 1 + 2 = 3
• New CPI =
67
![Page 112: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/112.jpg)
The Impact of Stalling On Performance
• ET = I * CPI * CT• I and CT are constant• What is the impact of stalling on CPI?
• Fraction of instructions that stall: 30%• Baseline CPI = 1• Stall CPI = 1 + 2 = 3
• New CPI =
67
0.3*3 + 0.7*1 = 1.6
![Page 113: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/113.jpg)
Solution 3: Bypassing/
• Data values are computed in Ex and Mem but “publicized in write back”
• The data exists! We should use it.
68
EXDeco
de
Fetch Mem Write
back
results known Results "published"
to registersinputs are needed
![Page 114: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/114.jpg)
• Take the values, where ever they are
•
Bypassing or Forwarding
69
EXDeco
de
Fetch Mem Write
backadd $s0, $t0, $t1
EXDeco
de
Fetch Mem Write
backsub $t2, $s0, $t3
Cycles
![Page 115: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/115.jpg)
Forwarding Paths
70
EXDeco
de
Fetch Mem Write
backadd $s0, $t0, $t1
EXDeco
de
Fetch Mem Write
backsub $t2, $s0, $t3
Cycles
EXDeco
de
Fetch Mem Write
back
EXDeco
de
Fetch Mem Write
back
sub $t2, $s0, $t3
sub $t2, $s0, $t3
![Page 116: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/116.jpg)
Forwarding in Hardware
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IFet
ch/D
ec
Dec
/Exe
c
Exec
/Mem
Mem
/WB
SignExtend
Add
![Page 117: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/117.jpg)
Hardware Cost of Forwarding• In our pipeline, adding forwarding required
relatively little hardware.• For deeper pipelines it gets much more
expensive• Roughly: ALU * pipe_stages you need to forward
over• Some modern processor have multiple ALUs (4-5)• And deeper pipelines (4-5 stages of to forward
across)• Not all forwarding paths need to be
supported.• If a path does not exist, the processor will need to
stall.72
![Page 118: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/118.jpg)
Forwarding for Loads• Values can also come from the Mem stage
• In this case, forward is not possible to the next instruction (and is not necessary for later instructions)
• Choices• Always stall!• Stall when needed!• Expose this in the ISA.
73
EXDeco
de
Fetch Mem Write
backld $s0, (0)$t0
EXDeco
de
Fetch Memsub $t2, $s0, $t3
Cycles
![Page 119: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/119.jpg)
Forwarding for Loads• Values can also come from the Mem stage
• In this case, forward is not possible to the next instruction (and is not necessary for later instructions)
• Choices• Always stall!• Stall when needed!• Expose this in the ISA.
73
EXDeco
de
Fetch Mem Write
backld $s0, (0)$t0
EXDeco
de
Fetch Memsub $t2, $s0, $t3
Cycles
Time travel presents significantimplementation challenges
![Page 120: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/120.jpg)
Forwarding for Loads• Values can also come from the Mem stage
• In this case, forward is not possible to the next instruction (and is not necessary for later instructions)
• Choices• Always stall!• Stall when needed!• Expose this in the ISA.
73
EXDeco
de
Fetch Mem Write
backld $s0, (0)$t0
EXDeco
de
Fetch Memsub $t2, $s0, $t3
Cycles
Time travel presents significantimplementation challenges
Which does MIPS do?
![Page 121: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/121.jpg)
Pros and Cons• Punt to the compiler
• This is what MIPS does and is the source of the load-delay slot
• Future versions must emulate a single load-delay slot.• The compiler fills the slot if possible, or drops in a nop.
• Always stall.• The compiler is oblivious, but performance will suffer• 10-15% of instructions are loads, and the CPI for loads
will be 2
• Forward when possible, stall otherwise• Here the compiler can order instructions to avoid the
stall.• If the compiler can’t fix it, the hardware will.
74
![Page 122: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/122.jpg)
Stalling for Load
75
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Load $s1, 0($s1)
Addi $t1, $s1, 4EXDeco
deFetch Mem Write
backEX
Decode
Fetch
To “stall” we insert a noop in place of the instruction and freeze the earlier stages of the
pipeline
![Page 123: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/123.jpg)
Stalling for Load
75
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Load $s1, 0($s1)
Addi $t1, $s1, 4EXDeco
deFetch Mem Write
backEX
Decode
Fetch
To “stall” we insert a noop in place of the instruction and freeze the earlier stages of the
pipeline
All stages of the pipeline earlier
than the stall stand still.
![Page 124: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/124.jpg)
Stalling for Load
75
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Load $s1, 0($s1)
Addi $t1, $s1, 4EXDeco
deFetch Mem Write
backEX
Decode
Fetch
To “stall” we insert a noop in place of the instruction and freeze the earlier stages of the
pipeline
All stages of the pipeline earlier
than the stall stand still.
Only four stages are occupied.
What’s in Mem?
![Page 125: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/125.jpg)
Inserting Noops
76
To “stall” we insert a noop in place of the instruction and freeze the earlier stages of the
pipeline
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
Load $s1, 0($s1)
Addi $t1, $s1, 4EXDeco
deFetch Mem Write
backEX
Decode
Fetch
Mem WritebackNoop inserted
The noop isin Mem
![Page 126: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/126.jpg)
Quiz 1 Score Distribution
77
0
2
4
6
8
10
12
14
16
18
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 127: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/127.jpg)
Quiz 2 Score Distribution
78
0
2
4
6
8
10
12
14
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 128: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/128.jpg)
Quiz 3 Score Distribution
79
0
5
10
15
20
25
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 129: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/129.jpg)
Quiz 4
80
0
2
4
6
8
10
12
14
16
18
F D- D+ C B- B+ A
# o
f st
ud
ents
![Page 130: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/130.jpg)
Key Points: Control Hazards• Control occur when we don’t know what the
next instruction is• Caused by branches and jumps.• Strategies for dealing with them
• Stall• Guess!
• Leads to speculation• Flushing the pipeline• Strategies for making better guesses
• Understand the difference between stall and flush
81
![Page 131: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/131.jpg)
Computing the PC Normally• Non-branch instruction
• PC = PC + 4• When is PC ready?
82
EXDeco
de
Fetch Mem Write
back
![Page 132: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/132.jpg)
Computing the PC Normally• Non-branch instruction
• PC = PC + 4• When is PC ready?
82
EXDeco
de
Fetch Mem Write
back
![Page 133: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/133.jpg)
Computing the PC Normally• Non-branch instruction
• PC = PC + 4• When is PC ready?
82
EXDeco
de
Fetch Mem Write
back
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writebacksub $t2, $s0, $t3
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
sub $t2, $s0, $t3
sub $t2, $s0, $t3
![Page 134: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/134.jpg)
Fixing the Ubiquitous Control Hazard• We need to know if an instruction is a branch
in the fetch stage!• How can we accomplish this?
83
EXDeco
de
Fetch Mem Write
back
Solution 1: Partially decode the instruction in fetch. You just need to
know if it’s a branch, a jump, or something else.
Solution 2: We’ll discuss later.
![Page 135: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/135.jpg)
Fixing the Ubiquitous Control Hazard• We need to know if an instruction is a branch
in the fetch stage!• How can we accomplish this?
83
EXDeco
de
Fetch Mem Write
back
Solution 1: Partially decode the instruction in fetch. You just need to
know if it’s a branch, a jump, or something else.
Solution 2: We’ll discuss later.
![Page 136: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/136.jpg)
Computing the PC Normally• Pre-decode in the fetch unit.
• PC = PC + 4• The PC is ready for the next fetch cycle.
84
EXDeco
de
Fetch Mem Write
back
![Page 137: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/137.jpg)
Computing the PC Normally• Pre-decode in the fetch unit.
• PC = PC + 4• The PC is ready for the next fetch cycle.
84
EXDeco
de
Fetch Mem Write
back
![Page 138: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/138.jpg)
Computing the PC Normally• Pre-decode in the fetch unit.
• PC = PC + 4• The PC is ready for the next fetch cycle.
84
EXDeco
de
Fetch Mem Write
back
EXDecode
Fetch Mem Writebackadd $s0, $t0, $t1
EXDecode
Fetch Mem Writebacksub $t2, $s0, $t3
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
sub $t2, $s0, $t3
sub $t2, $s0, $t3
![Page 139: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/139.jpg)
Computing the PC for Branches• Branch instructions
• bne $s1, $s2, offset• if ($s1 != $s2) { PC = PC + offset} else {PC = PC + 4;}
• When is the value ready?
85
EXDeco
de
Fetch Mem Write
back
EXDecode
Fetch Mem Writeback
bne $t2, $s0, somewhere
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
sll $s4, $t6, $t5
and $s4, $t0, $t1
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
![Page 140: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/140.jpg)
Computing the PC for Branches• Branch instructions
• bne $s1, $s2, offset• if ($s1 != $s2) { PC = PC + offset} else {PC = PC + 4;}
• When is the value ready?
85
EXDeco
de
Fetch Mem Write
back
EXDecode
Fetch Mem Writeback
bne $t2, $s0, somewhere
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
sll $s4, $t6, $t5
and $s4, $t0, $t1
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
![Page 141: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/141.jpg)
Computing the PC for Jumps• Jump instructions
• jr $s1 -- jump register• PC = $s1
• When is the value ready?
86
EXDeco
de
Fetch Mem Write
back
EXDecode
Fetch Mem Writeback
jr $s4
Cycles
EXDecode
Fetch Mem Writeback
sll $s4, $t6, $t5
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
![Page 142: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/142.jpg)
Computing the PC for Jumps• Jump instructions
• jr $s1 -- jump register• PC = $s1
• When is the value ready?
86
EXDeco
de
Fetch Mem Write
back
EXDecode
Fetch Mem Writeback
jr $s4
Cycles
EXDecode
Fetch Mem Writeback
sll $s4, $t6, $t5
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
![Page 143: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/143.jpg)
Dealing with Branches: Option 0 -- stall
• What does this do to our CPI?
87
EXDecode
Fetch Mem Writeback
bne $t2, $s0, somewhere
Cycles
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writeback
sll $s4, $t6, $t5
and $s4, $t0, $t1
add $s0, $t0, $t1 EXDecode
Fetch Mem Writeback
Fetch FetchStall
NOP EXDecode
Fetch Mem Writeback
NOP EXDecode
Fetch Mem Writeback
![Page 144: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/144.jpg)
Option 1: The compiler• Use “branch delay” slots.• The next N instructions after a branch are
always executed• How big is N?
• For jumps?• For branches?
• Good• Simple hardware
• Bad • N cannot change.
88
![Page 145: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/145.jpg)
Delay slots.
89
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writebackbne $t2, $s0, somewhere
Cycles
EXDecode
Fetch Mem Writeback
add $t2, $s4, $t1
...somewhere:sub $t2, $s0, $t3
EXDecode
Fetch Mem Writeback
Taken
add $s0, $t0, $t1
Branch Delay
![Page 146: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/146.jpg)
But MIPS Only Has One Delay Slot!
• The second branch delay slot is expensive!• Filling one slot is hard. Filling two is even more so.
• Solution!: Resolve branches in decode.
90
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+
Out
selin0in1
<<2
Control
en en en
en en en
inst[0:31]
en
A
B
Out
CMP
jumpbranchfunc
![Page 147: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/147.jpg)
For the rest of this slide deck, we will assume that MIPS
has no branch delay slot.
If you have questions about whether part of the homework/test/quiz makes this assumption.
91
![Page 148: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/148.jpg)
Option 2: Simple Prediction • Can a processor tell the future?• For non-taken branches, the new PC is ready
immediately.• Let’s just assume the branch is not taken• Also called “branch prediction” or “control
speculation”• What if we are wrong?• Branch prediction vocabulary
• Prediction -- a guess about whether a branch will be taken or not taken
• Misprediction -- a prediction that turns out to be incorrect.• Misprediction rate -- fraction of predictions that are
incorrect.
92
![Page 149: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/149.jpg)
Predict Not-taken
• We start the add, and then, when we discover the branch outcome, we squash it.• Also called “flushing the pipeline”
• Just like a stall, flushing one instruction increases the branch’s CPI by 1 93
EXDecode
Fetch Mem Writeback
EXDecode
Fetch Mem Writebackbne $t2, $s0, foo
Cycles
bne $t2, $s4, else
...else:sub $t2, $s0, $t3
EXDecode
Fetch Mem Writeback
Taken
Not-taken
EXDecode
Fetch Mem Writeback
add $s4, $t6, $t5
EXDecode
Fetch Mem Writeback
and $s4, $t0, $t1 Squash
![Page 150: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/150.jpg)
Flushing the Pipeline• When we flush the pipe, we convert instructions into noops
• Turn off the write enables for write back and mem stages• Disable branches (i.e., make sure the ALU does raise the branch signal).
• Instructions do not stop moving through the pipeline• For the example on the previous slide the
“inject_nop_decode_execute” signal will go high for one cycle.
94
EXRegfile
Fetch Mem Writeback
inject_nop_decode_execute
enbenb
stall_fetch_decode
stall_pcinject_nop_execute_mem
0 0Decode
Control signals
![Page 151: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/151.jpg)
Flushing the Pipeline• When we flush the pipe, we convert instructions into noops
• Turn off the write enables for write back and mem stages• Disable branches (i.e., make sure the ALU does raise the branch signal).
• Instructions do not stop moving through the pipeline• For the example on the previous slide the
“inject_nop_decode_execute” signal will go high for one cycle.
94
These signals for stalling
EXRegfile
Fetch Mem Writeback
inject_nop_decode_execute
enbenb
stall_fetch_decode
stall_pcinject_nop_execute_mem
0 0Decode
Control signals
![Page 152: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/152.jpg)
Flushing the Pipeline• When we flush the pipe, we convert instructions into noops
• Turn off the write enables for write back and mem stages• Disable branches (i.e., make sure the ALU does raise the branch signal).
• Instructions do not stop moving through the pipeline• For the example on the previous slide the
“inject_nop_decode_execute” signal will go high for one cycle.
94
These signals for stalling
EXRegfile
Fetch Mem Writeback
inject_nop_decode_execute
enbenb
stall_fetch_decode
stall_pcinject_nop_execute_mem
0 0Decode
Control signals
This signal is for both stalling and flushing
![Page 153: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/153.jpg)
Simple “static” Prediction• “static” means before run time• Many prediction schemes are possible• Predict taken
• Pros?• Predict not-taken
• Pros?• Backward taken/Forward not taken
• The best of both worlds!• Most loops have have a backward branch at the
bottom, those will predict taken• Others (non-loop) branches will be not-taken.
95
![Page 154: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/154.jpg)
Simple “static” Prediction• “static” means before run time• Many prediction schemes are possible• Predict taken
• Pros?• Predict not-taken
• Pros?• Backward taken/Forward not taken
• The best of both worlds!• Most loops have have a backward branch at the
bottom, those will predict taken• Others (non-loop) branches will be not-taken.
95
Loops are commons
![Page 155: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/155.jpg)
Simple “static” Prediction• “static” means before run time• Many prediction schemes are possible• Predict taken
• Pros?• Predict not-taken
• Pros?• Backward taken/Forward not taken
• The best of both worlds!• Most loops have have a backward branch at the
bottom, those will predict taken• Others (non-loop) branches will be not-taken.
95
Loops are commons
Not all branches are for loops.
![Page 156: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/156.jpg)
Basic Pipeline Recap
96
• The PC is required in Fetch• For branches, it’s not know till decode.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+
Out
selin0in1
<<2
en en enen
A
B
Out
CMP
jumpbranchfunc
Branches only, one delay slot, simplified ISA, no control
Should this be here?
![Page 157: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/157.jpg)
Supporting Speculation
97
• Predict: Compute all possible next PCs in fetch. Choose one.• The correct next PC is known in decode
• Flush as needed: Replace “wrong path” instructions with no-ops.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
![Page 158: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/158.jpg)
Supporting Speculation
97
• Predict: Compute all possible next PCs in fetch. Choose one.• The correct next PC is known in decode
• Flush as needed: Replace “wrong path” instructions with no-ops.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
PC Calculation
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
![Page 159: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/159.jpg)
Supporting Speculation
97
• Predict: Compute all possible next PCs in fetch. Choose one.• The correct next PC is known in decode
• Flush as needed: Replace “wrong path” instructions with no-ops.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
PC Calculation
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
![Page 160: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/160.jpg)
Supporting Speculation
97
• Predict: Compute all possible next PCs in fetch. Choose one.• The correct next PC is known in decode
• Flush as needed: Replace “wrong path” instructions with no-ops.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
![Page 161: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/161.jpg)
Supporting Speculation
97
• Predict: Compute all possible next PCs in fetch. Choose one.• The correct next PC is known in decode
• Flush as needed: Replace “wrong path” instructions with no-ops.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
Squash/Flush Control
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
![Page 162: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/162.jpg)
Supporting Speculation
97
• Predict: Compute all possible next PCs in fetch. Choose one.• The correct next PC is known in decode
• Flush as needed: Replace “wrong path” instructions with no-ops.
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
Squash/Flush Control
inst[7:11]Read0
Read1
Data0
Data1
write_enable
resetclk
Register File
Write
WriteData
sign ext
Out
selin0in1
A
B
OutALU
func
Out
selin0in1
Addr Data
write_enable
resetclk
Data Memory
WriteData
read_enablewidthsigned
A
B
Out+
inst[16:31]
inst[12:16]
inst[12:16]
PC
en
4
Out
selin0in1
inst[17:21]
Inst
Instruction Memory
Addr
A
B
Out+<<2
Control
en en
inst[0:31]
BranchPredict
sign ext
inst[16:31]
Out
selin0in1
enenOut
selin0in1in2
en enenOut
selin0in1
nopOut
selin0in1
nop
A
B
Out
CMP
jumpbranchfunc
![Page 163: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/163.jpg)
Implementing Backward taken/forward not taken (BTFNT)
• A new “branch predictor” module determines what guess we are going to make.
• The BTFNT branch predictor has one input• The sign of the offset -- to make the prediction• The branch signal from the comparator -- to check if
the prediction was correct.• And two output
• The PC mux selector• Steers execution in the predicted direction• Re-directs execution when the branch resolves.
• A mis-predict signal that causes control to flush the pipe.
98
![Page 164: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/164.jpg)
Performance Impact (ex 1)• ET = I * CPI * CT
• BTFTN is has a misprediction rate of 20%.• Branches are 20% of instructions• Changing the front end increases the cycle
time by 10%• What is the speedup BTFNT compared to
just stalling on every branch?
99
![Page 165: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/165.jpg)
Performance Impact (ex 1)• ET = I * CPI * CT
• Back taken, forward not taken is 80% accurate• Branches are 20% of instructions• Changing the front end increases the cycle time by 10%• What is the speedup Bt/Fnt compared to just stalling on every branch?• Btfnt
• CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04• CT = 1.1• IC = IC• ET = 1.144
• Stall• CPI = .2*2 + .8*1 = 1.2• CT = 1• IC = IC• ET = 1.2
• Speed up = 1.2/1.144 = 1.05
100
![Page 166: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/166.jpg)
The Branch Delay Penalty• The number of cycle between fetch and
branch resolution is called the “branch delay penalty”• It is the number of instruction that get flushed on a
misprediction.• It is the number of extra cycles the branch gets
charged (i.e., the CPI for mispredicted branches goes up by the penalty for)
101
![Page 167: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/167.jpg)
Performance Impact (ex 2)• ET = I * CPI * CT• Our current design resolves branches in decode, so the branch delay
penalty is 1 cycle.• If removing the comparator from decode (and resolving branches in execute)
would reduce cycle time by 20%, would it help or hurt performance? • Mis predict rate = 20%• Branches are 20% of instructions
• Resolve in Decode• CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04• CT = 1• IC = IC• ET = 1.04
• Resolve in execute• CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = 1.08• CT = 0.8• IC = IC• ET = 0.864
• Speedup = 1.2
102
![Page 168: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/168.jpg)
Performance Impact (ex 2)• ET = I * CPI * CT• Our current design resolves branches in decode, so the
branch delay penalty is 1 cycle.• If removing the comparator from decode (and resolving
branches in execute) would reduce cycle time by 20%, would it help or hurt performance? • Mis predict rate = 20%• Branches are 20% of instructions
103
![Page 169: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/169.jpg)
The Importance of Pipeline depth• There are two important parameters of the
pipeline that determine the impact of branches on performance• Branch decode time -- how many cycles does it take
to identify a branch (in our case, this is less than 1)• Branch resolution time -- cycles until the real branch
outcome is known (in our case, this is 2 cycles)
104
![Page 170: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/170.jpg)
Pentium 4 pipeline• Branches take 19 cycles to resolve• Identifying a branch takes 4 cycles.• Stalling is not an option.• 80% branch prediction accuracy is also not an option.• Not quite as bad now, but BP is still very important.
![Page 171: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/171.jpg)
Performance Impact (ex 1)• ET = I * CPI * CT
• Back taken, forward not taken is 80% accurate• Branches are 20% of instructions• Changing the front end increases the cycle time by 10%• What is the speedup Bt/Fnt compared to just stalling on every branch?• Btfnt
• CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04• CT = 1.144• IC = IC• ET = 1.144
• Stall• CPI = .2*2 + .8*1 = 1.2• CT = 1• IC = IC• ET = 1.2
• Speed up = 1.2/1.144 = 1.05
106
What if this were 20 instead of 1?
![Page 172: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/172.jpg)
Performance Impact (ex 1)• ET = I * CPI * CT
• Back taken, forward not taken is 80% accurate• Branches are 20% of instructions• Changing the front end increases the cycle time by 10%• What is the speedup Bt/Fnt compared to just stalling on every branch?• Btfnt
• CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04• CT = 1.144• IC = IC• ET = 1.144
• Stall• CPI = .2*2 + .8*1 = 1.2• CT = 1• IC = IC• ET = 1.2
• Speed up = 1.2/1.144 = 1.05
106
What if this were 20 instead of 1?
Branches are relatively infrequent (~20% of instructions), but
Amdahl’s Law tells that we can’t completely ignore this uncommon
case.
![Page 173: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/173.jpg)
Dynamic Branch Prediction• Long pipes demand higher accuracy than
static schemes can deliver.• Instead of making the the guess once (i.e.
statically), make it every time we see the branch.
• Many ways to predict dynamically• We will focus on predicting future behavior based on
past behavior
107
![Page 174: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/174.jpg)
Predictable control• Use previous branch behavior to predict
future branch behavior.• When is branch behavior predictable?
108
![Page 175: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/175.jpg)
Predictable control• Use previous branch behavior to predict future branch
behavior.• When is branch behavior predictable?
• Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 not-taken branch. All 10 are pretty predictable.
• Run-time constants• Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. • The branch is always taken or not taken.
• Corollated control• a = 10; b = <something usually larger than a >• if (a > 10) {}• if (b > 10) {}
• Function calls• LibraryFunction() -- Converts to a jr (jump register) instruction, but it’s always the
same.• BaseClass * t; // t is usually a of sub class, SubClass• t->SomeVirtualFunction() // will usually call the same function
109
![Page 176: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/176.jpg)
Dynamic Predictor 1: The Simplest Thing
• Predict that this branch will go the same way as the previous branch did.
• Pros?
• Cons?
110
![Page 177: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/177.jpg)
Dynamic Predictor 1: The Simplest Thing
• Predict that this branch will go the same way as the previous branch did.
• Pros?
• Cons?
110
Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work
better.
![Page 178: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/178.jpg)
Dynamic Predictor 1: The Simplest Thing
• Predict that this branch will go the same way as the previous branch did.
• Pros?
• Cons?
110
Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work
better.
An unpredictable branch in a loop will mess everything up. It can’t tell the difference between
branches
![Page 179: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/179.jpg)
Dynamic Prediction 2: A table of bits
• Give each branch it’s own bit in a table• Look up the prediction bit for the branch• How big does the table need to be?
• Pros:
• Cons:
111
![Page 180: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/180.jpg)
Dynamic Prediction 2: A table of bits
• Give each branch it’s own bit in a table• Look up the prediction bit for the branch• How big does the table need to be?
• Pros:
• Cons:
111
It can differentiate between branches. Bad behavior by one won’t mess up
others.... mostly.
![Page 181: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/181.jpg)
Dynamic Prediction 2: A table of bits
• Give each branch it’s own bit in a table• Look up the prediction bit for the branch• How big does the table need to be?
• Pros:
• Cons:
111
It can differentiate between branches. Bad behavior by one won’t mess up
others.... mostly.
Infinite! Bigger is better, but don’t mess with the cycle time. Index into it using the low order bits of
the PC
![Page 182: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/182.jpg)
Dynamic Prediction 2: A table of bits
• Give each branch it’s own bit in a table• Look up the prediction bit for the branch• How big does the table need to be?
• Pros:
• Cons:
111
It can differentiate between branches. Bad behavior by one won’t mess up
others.... mostly.
Infinite! Bigger is better, but don’t mess with the cycle time. Index into it using the low order bits of
the PC
Accuracy is still not great.
![Page 183: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/183.jpg)
Branch Prediction Trick #1• Associating prediction state with a particular branch.• We would like to keep separate prediction state for
every static branch. • In practice this is not possible, since there are a potentially
unbounded number of branches
• Instead, we use a heuristic to associate prediction state with a branch• The simplest heuristic is to use the low-order bits of the PC to
select the prediction state.
112
Table of predictor state
Low order bitsPC
Prediction
![Page 184: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/184.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
![Page 185: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/185.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA
![Page 186: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/186.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA not taken
![Page 187: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/187.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA
![Page 188: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/188.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA taken
![Page 189: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/189.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA
![Page 190: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/190.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA not taken
![Page 191: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/191.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA
![Page 192: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/192.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA taken
![Page 193: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/193.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
Table 16 of "last direction bits"
Prediction0xA
![Page 194: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/194.jpg)
Dynamic Prediction 2: A table of bits
113
• What’s the accuracy for the branch?
while (1) {for(j = 0; j < 4; j++) { // branch at address 0x100A}
}
iteration Actual prediction new prediction
1 taken not taken taken
2 taken taken taken
3 taken taken taken
4 not taken taken not taken
1 taken not taken take
2 taken taken taken
3 taken taken taken
50% or 2 per loop
Table 16 of "last direction bits"
Prediction0xA
![Page 195: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/195.jpg)
Dynamic prediction 3: A table of counters
• Instead of a single bit, keep two. This gives four possible states
• Taken branches move the state to the right. Not-taken branches move it to the left.
• The predictor waits one prediction before it changes its prediction
114
Strongly taken
Weakly taken
Strongly not taken
Weakly not taken
Taken branch
Taken branch Taken branch Taken branch
not taken branch not taken branch not taken branch not taken branch
Predict taken Predict not taken
![Page 196: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/196.jpg)
Dynamic Prediction 3: A table of counters
115
• What’s the accuracy for the inner loop’s branch? (start in weakly taken)
for(i = 0; i < 10; i++) {for(j = 0; j < 4; j++) {}
}
![Page 197: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/197.jpg)
Dynamic Prediction 3: A table of counters
115
• What’s the accuracy for the inner loop’s branch? (start in weakly taken)
for(i = 0; i < 10; i++) {for(j = 0; j < 4; j++) {}
}
iteration Actual state prediction new state
1 taken weakly taken taken strongly taken
2 taken strongly taken taken strongly taken
3 taken strongly taken taken strongly taken
4 not taken strongly taken taken weakly taken
1 taken weakly taken taken strongly taken
2 taken strongly taken taken strongly taken
3 taken strongly taken taken strongly taken
![Page 198: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/198.jpg)
Dynamic Prediction 3: A table of counters
115
• What’s the accuracy for the inner loop’s branch? (start in weakly taken)
for(i = 0; i < 10; i++) {for(j = 0; j < 4; j++) {}
}
iteration Actual state prediction new state
1 taken weakly taken taken strongly taken
2 taken strongly taken taken strongly taken
3 taken strongly taken taken strongly taken
4 not taken strongly taken taken weakly taken
1 taken weakly taken taken strongly taken
2 taken strongly taken taken strongly taken
3 taken strongly taken taken strongly taken
25% or 1 per loop
![Page 199: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/199.jpg)
Two-bit Prediction• The two bit prediction scheme is used very
widely and in many ways.• Make a table of 2-bit predictors• Devise a way to associate a 2-bit predictor with each
dynamic branch• Use the 2-bit predictor for each branch to make the
prediction.• In the previous example we associated the
predictors with branches using the PC. • We’ll call this “per-PC” prediction.
116
![Page 200: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/200.jpg)
Associating 2-bit Predictors with Branches:Using the low-order PC bits
• When is branch behavior predictable?• Loops -- for(i = 0; i < 10; i++) {} 9 taken
branches, 1 not-taken branch. All 10 are pretty predictable.
• Run-time constants• Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. • The branch is always taken or not taken.
• Corollated control• a = 10; b = <something usually larger than a >• if (a > 10) {}• if (b > 10) {}
• Function calls• LibraryFunction() -- Converts to a jr (jump register)
instruction, but it’s always the same.• BaseClass * t; // t is usually a of sub class, SubClass• t->SomeVirtualFunction() // will usually call the same
function
117
Good
OK -- we miss one per loop
Poor -- no help
Not applicable
![Page 201: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/201.jpg)
Predicting Loop Branches Revisited
118
• What’s the pattern we need to identify?
while (1){for(j = 0; j < 4; j++) {}
}
![Page 202: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/202.jpg)
Predicting Loop Branches Revisited
118
• What’s the pattern we need to identify?
while (1){for(j = 0; j < 4; j++) {}
}iteration Actual
1 taken2 taken3 taken4 not taken1 taken2 taken3 taken4 not taken1 taken2 taken3 taken4 not taken
![Page 203: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/203.jpg)
Dynamic prediction 4: Global branch history
• Instead of using the PC to choose the predictor, use a bit vector made up of the previous branch outcomes.
119
![Page 204: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/204.jpg)
Dynamic prediction 4: Global branch history
• Instead of using the PC to choose the predictor, use a bit vector made up of the previous branch outcomes.
119
iteration Actual Branch history Steady state prediction
1 taken 111112 taken 111113 taken 111114 not taken 11111
outer loop branch taken 11110 taken1 taken 11101 taken2 taken 11011 taken3 taken 10111 taken4 not taken , 01111 not taken
outer loop branch taken 11110 taken1 taken 11101 taken2 taken 11011 taken3 taken 10111 taken4 not taken , 01111 not taken
outer loop branch taken 11110 taken1 taken 11101 taken2 taken 11011 taken3 taken 10111 taken4 not taken , 01111 not taken
![Page 205: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/205.jpg)
Dynamic prediction 4: Global branch history
• Instead of using the PC to choose the predictor, use a bit vector made up of the previous branch outcomes.
119
iteration Actual Branch history Steady state prediction
1 taken 111112 taken 111113 taken 111114 not taken 11111
outer loop branch taken 11110 taken1 taken 11101 taken2 taken 11011 taken3 taken 10111 taken4 not taken , 01111 not taken
outer loop branch taken 11110 taken1 taken 11101 taken2 taken 11011 taken3 taken 10111 taken4 not taken , 01111 not taken
outer loop branch taken 11110 taken1 taken 11101 taken2 taken 11011 taken3 taken 10111 taken4 not taken , 01111 not taken
Nearly perfect
![Page 206: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/206.jpg)
Dynamic prediction 4: Global branch history
• How long should the history be?
• Imagine N bits of history and a loop that executes K iterations• If K <= N, history will do well.• If K > N, history will do poorly, since the history
register will always be all 1’s for the last K-N iterations. We will mis-predict the last branch.
120
![Page 207: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/207.jpg)
Dynamic prediction 4: Global branch history
• How long should the history be?
• Imagine N bits of history and a loop that executes K iterations• If K <= N, history will do well.• If K > N, history will do poorly, since the history
register will always be all 1’s for the last K-N iterations. We will mis-predict the last branch.
120
Infinite is a bad choice. We would
learn nothing.
![Page 208: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/208.jpg)
Associating Predictors with Branches:Global history
• When is branch behavior predictable?• Loops -- for(i = 0; i < 10; i++) {} 9 taken branches,
1 not-taken branch. All 10 are pretty predictable.• Run-time constants
• Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. • The branch is always taken or not taken.
• Corollated control• a = 10; b = <something usually larger than a >• if (a > 10) {}• if (b > 10) {}
• Function calls• LibraryFunction() -- Converts to a jr (jump register)
instruction, but it’s always the same.• BaseClass * t; // t is usually a of sub class, SubClass• t->SomeVirtualFunction() // will usually call the same function
121
Not so great
Good
Pretty good, as long as the history is not too long
Not applicable
![Page 209: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/209.jpg)
Other ways of identifying branches• Use local branch history
• Use a table of history registers (say 128), indexed by the low-order bits of the PC.
• Also use the PC to choose between 128 tables, each indexed by the history for that branch.
• For loops this does better than global history.• Foo() { for(i = 0; i < 4; i++){ } }. • If foo is called from many places, the global history will be polluted.
122
Table of
history
registers
Predictor
table 0
Predictor
table 1
PC Prediction
![Page 210: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/210.jpg)
Other Ways of Identifying Branches
• All these schemes have different pros and cons and will work better or worse for different branches.
• How do we get the best of all possible worlds?
123
![Page 211: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/211.jpg)
Other Ways of Identifying Branches
• All these schemes have different pros and cons and will work better or worse for different branches.
• How do we get the best of all possible worlds?
• Build them all, and have a predictor to decide which one to use on a given branch• For each branch, make all the different predictions,
and keep track which predictor is most often correct.• For future branches use the prediction from that
predictor.
124
![Page 212: Implementing a MIPS Processor• Break up the logic with latches into “pipeline stages” • Each stage can act on different data • Latches hold the inputs to their stage •](https://reader034.vdocuments.net/reader034/viewer/2022042022/5e794727e77e7a7752130752/html5/thumbnails/212.jpg)
End for Today
125