l17 – pipelining 1 comp 411 – spring 2008 4/1/2008 pipelining between spring break and 411...
TRANSCRIPT
L17 – Pipelining 1Comp 411 – Spring 2008 4/1/2008
Pipelining
Between Spring break and 411
problems sets, I haven’t had a minute to do
laundry
Now that’s what Icall dirty laundry
Read Chapter 6.1
L17 – Pipelining 2Comp 411 – Spring 2008 4/1/2008
Forget 411… Let’s Solve a “Relevant Problem”
Device: Washer
Function: Fill, Agitate, Spin
WasherPD = 30 mins
Device: Dryer
Function: Heat, Spin
DryerPD = 60 mins
INPUT:dirty laundry
OUTPUT:4 more weeks
L17 – Pipelining 3Comp 411 – Spring 2008 4/1/2008
One Load at a Time
Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do.
The fact is, doing laundry one load at a time is not smart.
(Sorry Mom, but you were wrong about this one!)
Step 1:
Step 2:
Total = WasherPD + DryerPD
= _________ mins90
L17 – Pipelining 4Comp 411 – Spring 2008 4/1/2008
Doing N Loads of Laundry
Here’s how they do laundry at Duke, the “combinational” way.
(Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner)
Step 1:
Step 2:
Step 3:
Step 4:
Total = N*(WasherPD + DryerPD)
= ____________ minsN*90
…
L17 – Pipelining 5Comp 411 – Spring 2008 4/1/2008
Doing N Loads… the UNC way
UNC students “pipeline” the laundry process.
That’s why we wait!
Step 1:
Step 2:
Step 3:
Total = N * Max(WasherPD, DryerPD)
= ____________ mins
N*60
…Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.
Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.
L17 – Pipelining 6Comp 411 – Spring 2008 4/1/2008
Recall Our Performance Measures
Latency:The delay from when an input is established until the output associated with that input becomes valid.
(Duke Laundry = _________ mins)( UNC Laundry = _________ mins)
Throughput:The rate of which inputs or outputs are processed.
(Duke Laundry = _________ outputs/min)( UNC Laundry = _________ outputs/min)
90
120
1/90
1/60
Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.
Even though we increaselatency, it takes less time per load
L17 – Pipelining 7Comp 411 – Spring 2008 4/1/2008
Okay, Back to Circuits…
F
G
HX P(X)
For combinational logic: latency = tPD, throughput = 1/tPD.
We can’t get the answer faster, but are we making effective use of our hardware at all times?
G(X)
F(X)
P(X)
X
F & G are “idle”, just holding their outputs stable while H performs its computation
L17 – Pipelining 8Comp 411 – Spring 2008 4/1/2008
Pipelined Circuits
use registers to hold H’s input stable!
F
G
HX P(X)
15
20
25
Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.Suppose F, G, H have propagation delays of 15,
20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0): latency
45______
throughput1/45______
unpipelined2-stage pipeline 50
worse
1/25better
Pipelining uses
registers to improve
the throughput
of combinational circuits
L17 – Pipelining 9Comp 411 – Spring 2008 4/1/2008
Pipeline Diagrams
Input
F Reg
G Reg
H Reg
i i+1 i+2 i+3
Xi Xi+1
F(Xi)
G(Xi)
Xi+2
F(Xi+1)
G(Xi+1)
H(Xi)
Xi+3
F(Xi+2)
G(Xi+2)
H(Xi+1)
Clock cycleP
ipelin
e s
tag
es
The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.
H(Xi+2)
…
…
F
G
HX P(X)
15
20
25
This is an exampleof parallelism. Atany instant we arecomputing 2 results.
L17 – Pipelining 10Comp 411 – Spring 2008 4/1/2008
Pipelining Summary
Advantages:– Higher throughput than combinational system– Different parts of the logic work on different
parts of the problem…
Disadvantages:– Generally, increases latency– Only as good as the *weakest* link
(often called the pipeline’s BOTTLENECK)
Isn’t there a way around this “weak link” problem?
This bottleneckis the onlyproblem
L17 – Pipelining 11Comp 411 – Spring 2008 4/1/2008
How do UNC students REALLY do Laundry?
They work around the bottleneck. First, they find a place with twice as many dryers as washers.
Throughput = ______ loads/min
Latency = ______ mins/load
Step 1:
Step 2:
Step 3:
Step 4:
1/30
90
L17 – Pipelining 12Comp 411 – Spring 2008 4/1/2008
Step 5:
Better Yet… Parallelism
Step 1:
Step 2:
Step 3:
Step 4:
We can combine interleavingand pipelining with parallelism.
Throughput =_______ load/min
Latency = _______ min
2/30 = 1/15
90
L17 – Pipelining 13Comp 411 – Spring 2008 4/1/2008
“Classroom Computer”
Row 1 Row 2 Row 3 Row 4 Row 5 Row 6Psets in
There are lots of problem sets to grade, each with six problems. Students in Row 1 grade Problem 1 and then hand it back to Row 2 for grading Problem 2, and so on… Assuming we want to pipeline the grading, how do we time the passing of papers between rows?
L17 – Pipelining 14Comp 411 – Spring 2008 4/1/2008
Controls for “Classroom Computer”
Synchronous Asynchronous
GloballyTimed
LocallyTimed
Teacher picks time interval long enough for worst-case student to grade toughest problem. Everyone passes psets at end of interval.
Teacher picks variable time interval long enough for current students to grade current set of problems. Everyone passes psets at end of interval.
Students raise hands when they finish grading current problem. Teacher checks every 10 secs, when all hands are raised, everyone passes psets to the row behind. Variant: students can pass when all students in a “column” have hands raised.
Students grade current problem, wait for student in next row to be free, and then pass the pset back.
L17 – Pipelining 15Comp 411 – Spring 2008 4/1/2008
Control Structure Taxonomy
Synchronous Asynchronous
GloballyTimed
LocallyTimed
Centralized clocked FSM generates all control signals.
Central control unit tailors current time slice to current tasks.
Start and Finish signals generated by each major subsystem, synchronously with global clock.
Each subsystem takes asynchronous Start, generates asynchronous Finish (perhaps using local clock).
Easy to design but fixed-sized interval can be wasteful (no data-dependencies in timing)
Large systems lead to very complicated timing generators… just say no!
The best way to build large systems that have independent components.
The “next big idea” for the last several decades: a lot of design work to do in general, but extra work is worth it in special cases
L17 – Pipelining 16Comp 411 – Spring 2008 4/1/2008
Review of CPU Performance
MIPS = Millions of Instructions/Second
Freq = Clock Frequency, MHz
CPI = Clocks per Instruction
MIPS =FreqCPI
To Increase MIPS:
1. DECREASE CPI.
- RISC simplicity reduces CPI to 1.0.
- CPI below 1.0? State-of-the-art multiple instruction issue
2. INCREASE Freq.
- Freq limited by delay along longest combinational path; hence
- PIPELINING is the key to improving performance.
L17 – Pipelining 17Comp 411 – Spring 2008 4/1/2008
miniMIPS TimingNew PC
PC+4 Fetch Inst.
Control Logic
Read Regs
ASEL mux BSEL mux
ALU
Fetch data
WDSEL mux
RF setupPC setup Mem setup
PCSEL mux
+OFFSET
CLK
CLK
The diagram on the left illustrates the Data Flow of miniMIPS
Wanted: longest path
Complications:
•some apparent paths aren’t “possible”
•functional units have variable execution times (eg, ALU)
•time axis is not to scale (eg, tPD,MEM is very big!)
Sign Extend
WASEL mux
L17 – Pipelining 18Comp 411 – Spring 2008 4/1/2008
Where Are the Bottlenecks?
Pipelining goal:Break LONG combinational paths memories, ALU in separate stages
WA
PC
+4
InstructionMemory
A
D
RegisterFile
RA1 RA2
RD1 RD2
ALUA B
WA
ALUFN
Control Logic
Data Memory
RD
WD R/W
Adr
Wr
WDSEL0 1 2
BSELWDSELALUFNWr
J:<25:0>
PCSEL
WERF
WERF
00
PC+4
Rt: <20:16>
Imm: <15:0>
ASEL
SEXT
+x4
BT
Z
BT
WASEL
Rd:<15:11>Rt:<20:16> 0
123
WASEL
PC<31:29>:J<25:0>:00
JT
JT
N V C
ZVN C
Rs: <25:21>
ASEL20
SEXT
BSEL01
SEXT
shamt:<10:6>
PCSEL 0123456
“16”
IRQ
0x800000800x800000400x80000000
RESET
“31”“27”
1
WD
WE
L17 – Pipelining 19Comp 411 – Spring 2008 4/1/2008
Ultimate Goal: 5-Stage Pipeline
GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)
APPROACH: structure processor as 5-stage pipeline:IF Instruction Fetch stage: Maintains
PC, fetches one instruction per cycle and passes it to
WB Write-Back stage: writes result back into register file.
ID/RFInstruction Decode/Register File
stage: Decode control lines and select source operands
ALUALU stage: Performs specified
operation, passes result to
MEM
Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to
L17 – Pipelining 20Comp 411 – Spring 2008 4/1/2008
miniMIPS Timing
Different instructions use various parts of the data path.
add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000
Program execution order
TimeCLK
Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData Access
Register Setup
sw $2, 20($4)
This is an example of a “Asynchronous Globally-Timed” control strategy (see Lecture 18). Such a system would vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!
1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS
6 nS2 nS2 nS5 nS4 nS6 nS1 nS
L17 – Pipelining 21Comp 411 – Spring 2008 4/1/2008
Uniform miniMIPS Timing
With a fixed clock period, we have to allow for the worse case.
add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000
Program execution order
TimeCLK
Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData Access
Register Setup
sw $2, 20($4)
By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a “Synchronous Globally-Timed” control strategy. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!
Isn’t the net effect just a slower CPU?
1 instr EVERY 20 nS
6 nS2 nS2 nS5 nS4 nS6 nS1 nS
L17 – Pipelining 22Comp 411 – Spring 2008 4/1/2008
Step 1: A 2-Stage Pipeline
IF
EXE
WA
PC
+4
InstructionMemory
A
D
RegisterFile
RA1 RA2
RD1 RD2
ALUA B
WA WD
WE
ALUFN
Control Logic
Data Memory
RD
WD R/W
Adr
Wr
WDSEL0 1 2
BSELWDSELALUFNWr
J:<25:0>
PCSEL
WERF
WERF
00
PC+4
Imm: <15:0>
ASEL
SEXT
+x4
BT
Z
BT
WASEL
Rd:<15:11>Rt:<20:16>
0123
WASEL
PC<31:29>:J<25:0>:00
JT
JT
N V C
ZVN C
Rt: <20:16>Rs: <25:21>
ASEL20
SEXT
BSEL01
SEXT
shamt:<10:6>
PCSEL 0123456
“16”
IRQ
0x800000800x800000400x80000000
RESET
“31”“27”
1
PCEXE 00 IREXE
IR stands for “Instruction Register”. The superscript “EXE” denotes the pipeline stage, in which the PC and IR are used.
L17 – Pipelining 23Comp 411 – Spring 2008 4/1/2008
2-Stage Pipe TimingImproves performance by increasing instruction throughput.
Ideal speedup is number of pipeline stages in the pipeline.
add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000
Program execution order
TimeCLK
Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData Access
Register Setup
sw $2, 20($4)
By partitioning each instruction cycle into a “fetch” stage and an “execute” stage, we get a simple pipeline. Why not include the Instruction-Decode/Register-Access time with the Instruction Fetch? You could. But this partitioning allows for a useful variant with 2-cycle loads and stores.
Latency?
Throughput?
2 Clock periods = 2*14 nS1 instrper14 nS
During this, and all subsequent clocks two instructions are in various stages of execution
6 nS2 nS2 nS5 nS4 nS6 nS1 nS
L17 – Pipelining 24Comp 411 – Spring 2008 4/1/2008
2-Stage w/2-Cycle Loads & StoresFurther improves performance, with slight increase in control complexity.
Some 1st generation (pre-cache) RISC processors used this approach.
add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000
Program execution order
TimeCLK
Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData Access
Register Setup
sw $2, 20($4)
The clock rate of this variant is over twice that of our original design. Does that mean it is that much faster?
923.1upspeed
)3.1(820
)3.0*27.0(periodclocknewperiodclockold
This design is very similar to the multicycle CPU described in section 5.5 of the text, but with pipelining.
Clock: 8 nS!
6 nS2 nS2 nS5 nS4 nS6 nS1 nS
Not likely. In practice, as many as 30% of instructions access memory. Thus, the effective speed up is:
L17 – Pipelining 25Comp 411 – Spring 2008 4/1/2008
2-Stage Pipelined Operation
...addi $t2,$t1,1xor $t2,$t1,$t2sltiu $t3,$t2,1srl $t2,$t2,1...
Consider a sequenceof instructions:
Executed on our 2-stage pipeline:
IF
EXE
i i+1 i+2 i+3 i+4 i+5 i+6
...xoraddi srlsltiu
...xoraddi srlsltiu
TIME (cycles)
Pip
elin
e
Recall “Pipeline Diagrams” from an earlier slide.
It can’t be this easy!?
L17 – Pipelining 26Comp 411 – Spring 2008 4/1/2008
ALUA B
ALUFN
Data MemoryRD
WD R/WAdrWr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BT
PC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
Step 2: 4-Stage miniMIPS
PCALU 00 IRALU
A B WDALU
PCMEM 00 IRMEM Y WDMEM
WARegister
File
WA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”
“27”
0 1 2 3
(NB: SAME RF AS ABOVE!)
Treats register file as two separate devices: combinational READ, clocked WRITE at end of pipe.
What other information do we have to pass down pipeline? PC instruction fields
What sort of improvement should expect in cycle time?
(return addresses)
(decoding)
InstructionFetch
RegisterFile
ALU
WriteBack
L17 – Pipelining 27Comp 411 – Spring 2008 4/1/2008
4-Stage miniMIPS Operation
...addi $t0,$t0,1sll $t1,$t1,2andi $t2,$t2,15sub $t3,$0,$t3...
Consider a sequenceof instructions:
Executed on our 4-stage pipeline:
IF
RF
ALU
WB
i i+1 i+2 i+3 i+4 i+5 i+6
... slladdi subandi
... slladdi subandi
... slladdi subandi
slladdi subandi
TIME (cycles)
Pip
elin
e