l17 – pipelining 1 comp 411 – spring 2008 4/1/2008 pipelining between spring break and 411...

L17 – Pipelining 1Comp 411 – Spring 2008 4/1/2008

Pipelining

Between Spring break and 411

problems sets, I haven’t had a minute to do

laundry

Now that’s what Icall dirty laundry

Read Chapter 6.1


Forget 411… Let’s Solve a “Relevant Problem”

Device: Washer

Function: Fill, Agitate, Spin

WasherPD = 30 mins

Device: Dryer

Function: Heat, Spin

DryerPD = 60 mins

INPUT:dirty laundry

OUTPUT:4 more weeks


One Load at a Time

Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do.

The fact is, doing laundry one load at a time is not smart.

(Sorry Mom, but you were wrong about this one!)

Step 1:

Step 2:

Total = WasherPD + DryerPD

= _________ mins90


Doing N Loads of Laundry

Here’s how they do laundry at Duke, the “combinational” way.

(Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner)

Step 1:

Step 2:

Step 3:

Step 4:

Total = N*(WasherPD + DryerPD)

= ____________ minsN*90

…


Doing N Loads… the UNC way

UNC students “pipeline” the laundry process.

That’s why we wait!

Step 1:

Step 2:

Step 3:

Total = N * Max(WasherPD, DryerPD)

= ____________ mins

N*60

…Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.

Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.


Recall Our Performance Measures

Latency:The delay from when an input is established until the output associated with that input becomes valid.

(Duke Laundry = _________ mins)( UNC Laundry = _________ mins)

Throughput:The rate of which inputs or outputs are processed.

(Duke Laundry = _________ outputs/min)( UNC Laundry = _________ outputs/min)

90

120

1/90

1/60

Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.

Even though we increaselatency, it takes less time per load


Okay, Back to Circuits…

F

G

HX P(X)

For combinational logic: latency = tPD, throughput = 1/tPD.

We can’t get the answer faster, but are we making effective use of our hardware at all times?

G(X)

F(X)

P(X)

X

F & G are “idle”, just holding their outputs stable while H performs its computation


Pipelined Circuits

use registers to hold H’s input stable!

F

G

HX P(X)

15

20

25

Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.Suppose F, G, H have propagation delays of 15,

20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0): latency

45______

throughput1/45______

unpipelined2-stage pipeline 50

worse

1/25better

Pipelining uses

registers to improve

the throughput

of combinational circuits


Pipeline Diagrams

Input

F Reg

G Reg

H Reg

i i+1 i+2 i+3

Xi Xi+1

F(Xi)

G(Xi)

Xi+2

F(Xi+1)

G(Xi+1)

H(Xi)

Xi+3

F(Xi+2)

G(Xi+2)

H(Xi+1)

Clock cycleP

ipelin

e s

tag

es

The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

H(Xi+2)

…

…

F

G

HX P(X)

15

20

25

This is an exampleof parallelism. Atany instant we arecomputing 2 results.


Pipelining Summary

Advantages:– Higher throughput than combinational system– Different parts of the logic work on different

parts of the problem…

Disadvantages:– Generally, increases latency– Only as good as the *weakest* link

(often called the pipeline’s BOTTLENECK)

Isn’t there a way around this “weak link” problem?

This bottleneckis the onlyproblem


How do UNC students REALLY do Laundry?

They work around the bottleneck. First, they find a place with twice as many dryers as washers.

Throughput = ______ loads/min

Latency = ______ mins/load

Step 1:

Step 2:

Step 3:

Step 4:

1/30

90


Step 5:

Better Yet… Parallelism

Step 1:

Step 2:

Step 3:

Step 4:

We can combine interleavingand pipelining with parallelism.

Throughput =_______ load/min

Latency = _______ min

2/30 = 1/15

90


“Classroom Computer”

Row 1 Row 2 Row 3 Row 4 Row 5 Row 6Psets in

There are lots of problem sets to grade, each with six problems. Students in Row 1 grade Problem 1 and then hand it back to Row 2 for grading Problem 2, and so on… Assuming we want to pipeline the grading, how do we time the passing of papers between rows?


Controls for “Classroom Computer”

Synchronous Asynchronous

GloballyTimed

LocallyTimed

Teacher picks time interval long enough for worst-case student to grade toughest problem. Everyone passes psets at end of interval.

Teacher picks variable time interval long enough for current students to grade current set of problems. Everyone passes psets at end of interval.

Students raise hands when they finish grading current problem. Teacher checks every 10 secs, when all hands are raised, everyone passes psets to the row behind. Variant: students can pass when all students in a “column” have hands raised.

Students grade current problem, wait for student in next row to be free, and then pass the pset back.


Control Structure Taxonomy

Synchronous Asynchronous

GloballyTimed

LocallyTimed

Centralized clocked FSM generates all control signals.

Central control unit tailors current time slice to current tasks.

Start and Finish signals generated by each major subsystem, synchronously with global clock.

Each subsystem takes asynchronous Start, generates asynchronous Finish (perhaps using local clock).

Easy to design but fixed-sized interval can be wasteful (no data-dependencies in timing)

Large systems lead to very complicated timing generators… just say no!

The best way to build large systems that have independent components.

The “next big idea” for the last several decades: a lot of design work to do in general, but extra work is worth it in special cases


Review of CPU Performance

MIPS = Millions of Instructions/Second

Freq = Clock Frequency, MHz

CPI = Clocks per Instruction

MIPS =FreqCPI

To Increase MIPS:

1. DECREASE CPI.

- RISC simplicity reduces CPI to 1.0.

- CPI below 1.0? State-of-the-art multiple instruction issue

2. INCREASE Freq.

- Freq limited by delay along longest combinational path; hence

- PIPELINING is the key to improving performance.


miniMIPS TimingNew PC

PC+4 Fetch Inst.

Control Logic

Read Regs

ASEL mux BSEL mux

ALU

Fetch data

WDSEL mux

RF setupPC setup Mem setup

PCSEL mux

+OFFSET

CLK

CLK

The diagram on the left illustrates the Data Flow of miniMIPS

Wanted: longest path

Complications:

•some apparent paths aren’t “possible”

•functional units have variable execution times (eg, ALU)

•time axis is not to scale (eg, tPD,MEM is very big!)

Sign Extend

WASEL mux


Where Are the Bottlenecks?

Pipelining goal:Break LONG combinational paths memories, ALU in separate stages

WA

PC

+4

InstructionMemory

A

D

RegisterFile

RA1 RA2

RD1 RD2

ALUA B

WA

ALUFN

Control Logic

Data Memory

RD

WD R/W

Adr

Wr

WDSEL0 1 2

BSELWDSELALUFNWr

J:<25:0>

PCSEL

WERF

WERF

00

PC+4

Rt: <20:16>

Imm: <15:0>

ASEL

SEXT

+x4

BT

Z

BT

WASEL

Rd:<15:11>Rt:<20:16> 0

123

WASEL

PC<31:29>:J<25:0>:00

JT

JT

N V C

ZVN C

Rs: <25:21>

ASEL20

SEXT

BSEL01

SEXT

shamt:<10:6>

PCSEL 0123456

“16”

IRQ

0x800000800x800000400x80000000

RESET

“31”“27”

1

WD

WE


Ultimate Goal: 5-Stage Pipeline

GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:IF Instruction Fetch stage: Maintains

PC, fetches one instruction per cycle and passes it to

WB Write-Back stage: writes result back into register file.

ID/RFInstruction Decode/Register File

stage: Decode control lines and select source operands

ALUALU stage: Performs specified

operation, passes result to

MEM

Memory stage: If it’s a lw, use ALU result as an address, pass mem data (or ALU result if not lw) to


miniMIPS Timing

Different instructions use various parts of the data path.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000

Program execution order

TimeCLK

Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData Access

Register Setup

sw $2, 20($4)

This is an example of a “Asynchronous Globally-Timed” control strategy (see Lecture 18). Such a system would vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!

1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS

6 nS2 nS2 nS5 nS4 nS6 nS1 nS


Uniform miniMIPS Timing

With a fixed clock period, we have to allow for the worse case.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000


TimeCLK


Register Setup

sw $2, 20($4)

By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a “Synchronous Globally-Timed” control strategy. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!

Isn’t the net effect just a slower CPU?

1 instr EVERY 20 nS



Step 1: A 2-Stage Pipeline

IF

EXE

WA

PC

+4

InstructionMemory

A

D

RegisterFile

RA1 RA2

RD1 RD2

ALUA B

WA WD

WE

ALUFN

Control Logic

Data Memory

RD

WD R/W

Adr

Wr

WDSEL0 1 2

BSELWDSELALUFNWr

J:<25:0>

PCSEL

WERF

WERF

00

PC+4

Imm: <15:0>

ASEL

SEXT

+x4

BT

Z

BT

WASEL

Rd:<15:11>Rt:<20:16>

0123

WASEL

PC<31:29>:J<25:0>:00

JT

JT

N V C

ZVN C

Rt: <20:16>Rs: <25:21>

ASEL20

SEXT

BSEL01

SEXT

shamt:<10:6>

PCSEL 0123456

“16”

IRQ

0x800000800x800000400x80000000

RESET

“31”“27”

1

PCEXE 00 IREXE

IR stands for “Instruction Register”. The superscript “EXE” denotes the pipeline stage, in which the PC and IR are used.


2-Stage Pipe TimingImproves performance by increasing instruction throughput.

Ideal speedup is number of pipeline stages in the pipeline.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000


TimeCLK


Register Setup

sw $2, 20($4)

By partitioning each instruction cycle into a “fetch” stage and an “execute” stage, we get a simple pipeline. Why not include the Instruction-Decode/Register-Access time with the Instruction Fetch? You could. But this partitioning allows for a useful variant with 2-cycle loads and stores.

Latency?

Throughput?

2 Clock periods = 2*14 nS1 instrper14 nS

During this, and all subsequent clocks two instructions are in various stages of execution



2-Stage w/2-Cycle Loads & StoresFurther improves performance, with slight increase in control complexity.

Some 1st generation (pre-cache) RISC processors used this approach.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000


TimeCLK


Register Setup

sw $2, 20($4)

The clock rate of this variant is over twice that of our original design. Does that mean it is that much faster?

923.1upspeed

)3.1(820

)3.0*27.0(periodclocknewperiodclockold

This design is very similar to the multicycle CPU described in section 5.5 of the text, but with pipelining.

Clock: 8 nS!


Not likely. In practice, as many as 30% of instructions access memory. Thus, the effective speed up is:


2-Stage Pipelined Operation

...addi $t2,$t1,1xor $t2,$t1,$t2sltiu $t3,$t2,1srl $t2,$t2,1...

Consider a sequenceof instructions:

Executed on our 2-stage pipeline:

IF

EXE

i i+1 i+2 i+3 i+4 i+5 i+6

...xoraddi srlsltiu

...xoraddi srlsltiu

TIME (cycles)

Pip

elin

e

Recall “Pipeline Diagrams” from an earlier slide.

It can’t be this easy!?


ALUA B

ALUFN

Data MemoryRD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

Step 2: 4-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

(NB: SAME RF AS ABOVE!)

Treats register file as two separate devices: combinational READ, clocked WRITE at end of pipe.

What other information do we have to pass down pipeline? PC instruction fields

What sort of improvement should expect in cycle time?

(return addresses)

(decoding)

InstructionFetch

RegisterFile

ALU

WriteBack


4-Stage miniMIPS Operation

...addi $t0,$t0,1sll $t1,$t1,2andi $t2,$t2,15sub $t3,$0,$t3...

Consider a sequenceof instructions:

Executed on our 4-stage pipeline:

IF

RF

ALU

WB

i i+1 i+2 i+3 i+4 i+5 i+6

... slladdi subandi

... slladdi subandi

... slladdi subandi

slladdi subandi

TIME (cycles)

Pip

elin

e

l17 – pipelining 1 comp 411 – spring 2008 4/1/2008 pipelining between spring break and 411...

Documents

l17 pipelining

duke laundry

mins unc laundry

washer pd dryer pd

laundry process

mins n

outputsmin unc laundry

mins input