cosc 6385 computer architecture - pipelininggabriel/courses/cosc6385_f06/ca_04_pipelining.pdf ·...
Embed Size (px)
TRANSCRIPT

COSC 6385 – Computer ArchitectureEdgar Gabriel
COSC 6385 Computer Architecture
- Pipelining
Edgar GabrielFall 2006
Some of the slides are based on a lecture by David Culler, University of California, Berkley
http://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Instruction Set Architecture• Relevant features for distinguishing ISA’s
– Internal storage– Memory addressing– Type and size of operands– Operations– Instructions for Flow Control– Encoding of the IS

COSC 6385 – Computer ArchitectureEdgar Gabriel
Pipelining• Pipelining is an implementation technique whereby
multiple instructions are overlapped in execution– Split an “expensive” operation into several sub-operations– Execute the sub-operations in a staggered manner
• Real world analogy: assembly line in car manufacturing– Each station is doing something different– Each station working on a separate car
• Pipelining increases the throughput, but does not reduce the latency of an operation

COSC 6385 – Computer ArchitectureEdgar Gabriel
Classes of instructions• ALU instructions
– Take either 2 registers as operands or 1 register and one 16bit immediate offset
– Results are stored in a 3rd register• Load and store instructions• Branches and jumps

COSC 6385 – Computer ArchitectureEdgar Gabriel
Typical implementation of an instruction (I)
1. Instruction fetch cycle (IF):• send PC to memory • Fetch current instruction• Update PC to next sequential PC (+4 bytes)
2. Instruction decode/register fetch cycle (ID)• Decode instruction• Read registers corresponding to register source specifiers
from register file• Sign extend offset fields if needed• Compute possible branch target address

COSC 6385 – Computer ArchitectureEdgar Gabriel
Typical implementation of an instruction (II)
3. Execution /effective address cycle (EX)• ALU adds base register and offset to form effective address or• ALU performs operations on the values read from register file or• ALU performs operation on value read from register and sign-
extended immediate4. Memory access cycle (MEM)
• If instruction is a load, read memory using the effective address computed in step 3
• If instruction is a store, write the data from the second register read of the register file to the effective address
5. Write-back cycle (WB)• Write result into register file
• From memory for a load instruction• From ALU for an ALU instruction

COSC 6385 – Computer ArchitectureEdgar Gabriel
Typical implementation of an instruction (III)
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Adder Zero?
Next SEQ PC
PC
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Datapath (I)
4
Adder
PC Readaddress
Instruction memory
Instruction
Fetching instructions and incrementing program count (PC)

COSC 6385 – Computer ArchitectureEdgar Gabriel
Datapath (II)
Readregister 1
Register file
ALU instructions, e.g. add R1, R2, R3
Readregister 2
Writeregister
WriteData
Readdata 1
Readdata 2
5
5
5 ALU
Registernumbers
Data
Data
RegWrite
ALU operation4
ALUresult
Zero
Register number input is 5 bit wide if you have 32(=25) registers
Write control signal
ALU operationcontrol signal (4 bits)

COSC 6385 – Computer ArchitectureEdgar Gabriel
Datapath (III)
Address
Data memory
Load/Store instructions, e.g. LW R1,offset (R2)
WriteData
Readdata
MemRead
MemWrite
SignExtend
16 32
Basic steps for a load/store operation• sign extend the offset from 16 to 32 bit• add the sign extended offset to R2
• Load the content of the resulting address into R1 or• store the data from R1 into the resulting memory address

COSC 6385 – Computer ArchitectureEdgar Gabriel
Datapath (IV)Combining Load/Store and ALU instructions
Readregister 1
Register file
Read register 2Writeregister WriteData
Readdata 1
Readdata 2
RegWrite
Instruction
SignExtend
16 32
MUX
0
1
ALU
4
Address
Data memory
WriteData
Readdata
MemRead
MemWriteALUsrc
MUX
0
1
MemtoReg
ALU operation

COSC 6385 – Computer ArchitectureEdgar Gabriel
Datapath (V)Branches e.g. beq R1,R2,offset
Basic steps for a branch equal instruction• compute branch target address
• sign extended offset field• shift offset field by 2 bits in order to ensure a word offset• add shifted, sign-extended offset to PC• compare registers R1 and R2

COSC 6385 – Computer ArchitectureEdgar Gabriel
Datapath (VI)Implementation of branches, e.g. beq R1,R2,offset
Readregister 1
Register file
Read register 2Writeregister WriteData
Readdata 1
Readdata 2
RegWrite
SignExtend
16 32
ALU
4 ALU operationInstruction
To branch control logic
Shift Left 2
Add
PC+4 from instruction datapathBranchtarget

COSC 6385 – Computer ArchitectureEdgar Gabriel
Visualizing pipelining
Instr.
Order
Time (clock cycles)
ID ALU MemIF WB
ID ALU MemIF WB
ID ALU MemIF WB
ID ALU MemIF WB
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Effects of pipelining• A pipeline of depth n requires n-times the memory bandwidth of a
non-pipelined processor for the same clock rate• Separate data and instruction cache eliminates some memory
conflicts• Register file is used in stage ID and in WB
– Usually not a conflict, since write’s are executed in the first half of the clock-cycle and read’s in the second half
• Instructions in the pipeline should not attempt to use the same hardware resources at the same time– Introduction pipeline registers between successive stages of the
pipeline– Registers named after the stages they connect (e.g. IF/ID,
ID/ALU, etc.)

COSC 6385 – Computer ArchitectureEdgar Gabriel
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD
Next PC
Address
RS1
RS2
Imm
MU
X
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Pipeline Hazards• Limits to pipelining: Hazards prevent next instruction
from executing during its designated clock cycle– Structural hazards: HW cannot support this combination of
instructions – Data hazards: Instruction depends on result of prior
instruction still in the pipeline – Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches and jumps).
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
One Memory Port/Structural Hazards
Instr.
Order
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
One Memory Port/Structural Hazards
Instr.
Order
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Speed Up Equation for Pipelining
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup ×+×
=
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup ×+
=
Instper cycles Stall Average CPI Ideal CPIpipelined +=
For simple RISC pipeline, CPI = 1:
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Example: Dual-port vs. Single-port• Machine A: Dual ported memory (“Harvard Architecture”)• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)= (Pipeline Depth/1.4) x 1.05= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazard on R1Time (clock cycles)
IF ID/RF EX MEM WB
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
• Write After Read (WAR)InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Can’t happen in our 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Three Generic Data Hazards• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
• Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5
• Will see WAR and WAW in more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Time (clock cycles)
Forwarding to Avoid Data Hazard
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with Forwarding
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Data Hazard Even with ForwardingTime (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU DMemIfetch Reg
RegIfetch ALU DMem RegBubble
Ifetch ALU DMem RegBubble Reg
Ifetch ALU DMemBubble Reg
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Adder
IF/ID
Branches: Pipelined DatapathMemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
RegFile
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX/M
EM4
Adder
Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1
RS2
ImmM
UX
ID/EX
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% branches taken on average– But haven’t calculated branch target address yet
• still incurs 1 cycle branch penalty• Other machines: branch target known before outcome
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Four Branch Hazard Alternatives#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
Branch delay of length n
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05

COSC 6385 – Computer ArchitectureEdgar Gabriel
Delayed Branch• Where to get instructions to fill branch delay slot?
– Before branch instruction– From the target address: only valuable when branch
taken– From fall through: only valuable when branch not taken
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots
useful in computation
Slide based on a lecture by David Culler, University of California, Berkleyhttp://www.eecs.berkeley.edu/~culler/courses/cs252-s05