vii. process pipelining725g84/include/725g84-7-pipeline.pdf · branch prediction strategies: static...
TRANSCRIPT
VII. Process Pipelining� Computer Architecture and
Operating Systems & Operating
Systems: 725G84
Ahmed Rezine
1 Computer Architecture & Operating Systems (725G84 - HT16)
Components of a Computer
Computer Architecture & Operating Systems (725G84 - HT16)
Recap: stages of the datapath
Computer Architecture & Operating Systems (725G84 - HT16)
Recap: the stages of the datapath
We can look at the datapath as five stages, one step per stage IF: Instruction fetch from memory ID: Instruction decode & register read EX: Execute operation or calculate address MEM: Access memory operand WB: Write result back to register
Computer Architecture & Operating Systems (725G84 - HT16)
The 5 Stages
Computer Architecture & Operating Systems (725G84 - HT16)
Single Cycle: Performance Assume time for stages is
100ps for register read or write 200ps for other stages
What is the clock cycle time ?
Computer Architecture & Operating Systems (725G84 - HT16)
Single Cycle: Performance Assume time for stages is
100ps for register read or write 200ps for other stages
Computer Architecture & Operating Systems (725G84 - HT16)
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
lw
sw
R-format
beq
Single Cycle: Performance Assume time for stages is
100ps for register read or write 200ps for other stages
Computer Architecture & Operating Systems (725G84 - HT16)
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Single Cycle: Performance
Computer Architecture & Operating Systems (725G84 - HT16)
Performance Issues
Longest delay determines clock period Critical path: load instruction Instruction memory → register file → ALU →
data memory → register file
Computer Architecture & Operating Systems (725G84 - HT16)
Performance Issues Violates design principle
Making the common case fast
Improve performance by pipelining!
Computer Architecture & Operating Systems (725G84 - HT16)
! Pipelining = Overlap the stages
Performance Issues
Computer Architecture & Operating Systems (725G84 - HT16)
Pipeline
Computer Architecture & Operating Systems (725G84 - HT16)
Clock Cycle Even if some stages take only 100ps instead
of 200ps, the pipelined execution clock cycle must have the worst case clock cycle time of 200ps
Computer Architecture & Operating Systems (725G84 - HT16)
Speedup from Pipelining If all stages are balanced
i.e., all take the same time
Computer Architecture & Operating Systems (725G84 - HT16)
Time between pipelined instructions = Time between non-pipelined instructions
Number of stages
Speedup from Pipelining If the stages are not balanced (not equal),
speedup is less. Look at our example: Time between instructions (non-pipelined) =
800ps Number of stages = 5 Time between instructions (pipelined) =
800/5 =160 ps But, what did we get ?
Also read text in p.334
Computer Architecture & Operating Systems (725G84 - HT16)
Speedup from Pipelining Recall from Chapter 1: Throughput
versus latency
What does pipelining improve?
Speedup in pipelining is due to increased throughput Latency (time for each instruction) does not
decrease
Computer Architecture & Operating Systems (725G84 - HT16)
Hazards Situations that prevent starting the
next instruction in the next cycle are called hazards
There are 3 kinds of hazards Structural hazards Data hazards Control hazards
Computer Architecture & Operating Systems (725G84 - HT16)
Structural Hazards When an instruction cannot execute in
proper cycle due to a conflict for use of a resource
Computer Architecture & Operating Systems (725G84 - HT16)
Example of structural hazards
Imagine a MIPS pipeline with a single data and instruction memory a fourth additional instruction in the pipeline; when this instruction
is trying to fetch from instruction memory, the first instruction is fetching from data memory – leading to a hazard
Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches
Computer Architecture & Operating Systems (725G84 - HT16)
Data Hazards When an instruction cannot execute in
its proper cycle because it depends on completion of data access by a previous instruction
Computer Architecture & Operating Systems (725G84 - HT16)
Data Hazards add $s0, $t0, $t1 sub $t2, $s0, $t3
For the above code, when can we start the second instruction ?
Computer Architecture & Operating Systems (725G84 - HT16)
Data Hazards add $s0, $t0, $t1 sub $t2, $s0, $t3
Computer Architecture & Operating Systems (725G84 - HT16)
IM Reg DM WB
IM Reg DM WB
Bubble Bubble Bubble
Bubble Bubble Bubble
Bubble Bubble Bubble
Bubble Bubble
Bubble Bubble
Bubble Bubble
Pipeline stalls Bubble or the pipeline stall is used to resolve
a hazard
BUT it leads to wastage of cycles = performance deterioration
Solve the performance problem by ‘forwarding’ the required data
Computer Architecture & Operating Systems (725G84 - HT16)
Forwarding (aka Bypassing)
Use result when it is computed : don’t wait for it to be stored in a register
Requires extra connections in the datapath
Computer Architecture & Operating Systems (725G84 - HT16)
Forwarding (aka Bypassing)
add $s0, $t0, $t1 sub $t2, $s0, $t3
For the above code, draw the datapath with forwarding
Computer Architecture & Operating Systems (725G84 - HT16)
Forwarding (aka Bypassing)
add $s0, $t0, $t1�sub $t2, $s0, $t3
For the above code, draw the datapath with forwarding
Computer Architecture & Operating Systems (725G84 - HT16)
Load-Use Data Hazard Can’t always avoid stalls by forwarding
If value not computed when needed
Computer Architecture & Operating Systems (725G84 - HT16)
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction
C code for A = B + E; C = B + F;
Computer Architecture & Operating Systems (725G84 - HT16)
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
13 cycles
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
stall
13 cycles
Computer Architecture & Operating Systems (725G84 - HT16)
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction
C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles 13 cycles
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the next instruction
C code for A = B + E; C = B + F;
Computer Architecture & Operating Systems (725G84 - HT16)
Control Hazards When the proper instruction cannot execute
in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed
Computer Architecture & Operating Systems (725G84 - HT16)
Control Hazards Branch determines flow of control
Fetching next instruction depends on branch outcome
Naïve/Simple Solution: Stall if we fetch a branch instruction
Computer Architecture & Operating Systems (725G84 - HT16)
Stalling for Branch Hazards
Computer Architecture & Operating Systems (725G84 - HT16)
beq $4, $0, there
and $12, $2, $5
or ...
add ...
sw ...
IM Reg DM WB
IM Reg
IM Reg DM
IM Reg DM WB
IM Reg DM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Bubble Bubble Bubble
Stalling for Branch Hazards
Seems wasteful, particularly when the branch isn’t taken.
What if we predict?
Computer Architecture & Operating Systems (725G84 - HT16)
Branch prediction Correct branch prediction is very important
and can produce substantial performance improvements.
Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution path.
Branch prediction strategies: Static prediction Dynamic prediction
Computer Architecture & Operating Systems (725G84 - HT16)
Branch prediction Correct branch prediction is very important
and can produce substantial performance improvements.
Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution path.
Branch prediction strategies: Static prediction Dynamic prediction
Computer Architecture & Operating Systems (725G84 - HT16)
If it turns out that the prediction was correct, execution goes on without introducing any branch penalty. If, however, the prediction is not fulfilled, the instruction(s) started in advance and all their associated data must be purged and the state previous to their execution restored.
Static branch prediction Static prediction techniques do not take into
consideration execution history.
Static approaches: Predict never taken: assumes that the branch
is not taken. Predict always taken: assumes that the
branch is taken.
Computer Architecture & Operating Systems (725G84 - HT16)
Assume Branch Not Taken
works pretty well when you’re right
Computer Architecture & Operating Systems (725G84 - HT16)
beq $4, $0, there
and $12, $2, $5
or ...
add ...
sw ...
IM Reg DM WB
IM Reg
IM Reg DM
IM Reg DM WB
IM Reg DM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Assume Branch Not Taken
same performance as stalling when you’re wrong
Computer Architecture & Operating Systems (725G84 - HT16)
beq $4, $0, there
and $12, $2, $5
or ...
add ...
there: sub $12, $4, $2
IM Reg
IM Reg
IM
IM Reg
IM Reg DM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Flush
Flush
Flush
Wrong Path
Dynamic Branch Prediction
Dynamic approach : predict based on recent history
E.g, assume that the branch will do what it did the last time
Computer Architecture & Operating Systems (725G84 - HT16)
One Bit Predictor One-bit is used in order to record the last
execution
The system predicts the same behavior as for the last time.
- - - - - - - - - -
LOOP: - - - - - - - - - - -
- - - - - - - - - - -
BNE R1, R2, LOOP
- - - - - - - - - - -
Computer Architecture & Operating Systems (725G84 - HT16)
1-Bit Predictor: Shortcoming
Inner loop branches mispredicted twice!
Computer Architecture & Operating Systems (725G84 - HT16)
outer: … … inner: … … beq …, …, inner … beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
2-Bit Predictor Only change prediction on two successive
mispredictions
Computer Architecture & Operating Systems (725G84 - HT16)
Branch History Table
Computer Architecture & Operating Systems (725G84 - HT16)
MIPS with Prediction Schemes
Computer Architecture & Operating Systems (725G84 - HT16)
Prediction correct
Prediction incorrect
No pipeline stalls
Pipeline stalls and must be flushed. The execution has to be restarted.
Recap: Hazards Situations that prevent starting the next
instruction in the next cycle
Structural hazards A required resource is busy
Data hazards Need to wait for previous instruction to complete
its data read/write
Control hazards Deciding on control action depends on previous
instruction
Computer Architecture & Operating Systems (725G84 - HT16)
Pipeline Summary
Computer Architecture & Operating Systems (725G84 - HT16)
! Pipelining improves performance by increasing instruction throughput ! Executes multiple instructions in parallel
! Each instruction has the same latency
! Subject to hazards ! Structure, data, control
! Instruction set design affects complexity of pipeline implementation (p.335)