frtyuiop

8
CSC506 Pipeline Homework – due Wednesday, June 9, 1999 Question 1. An instruction requires four stages to execute: stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns. An instruction must proceed through the stages in sequence. What is the minimum asynchronous time for any single instruction to complete? 30 + 9 + 20 + 10 = 69 ns. Question 2. We want to set this up as a pipelined operation. How many stages should we have and at what rate should we clock the pipeline? We have 4 natural stages given and no information on how we might be able to further subdivide them, so we use 4 stages in our pipeline. We have a choice of what clock rate to use. The simplest choice would be to use a clock cycle that accommodates the longest stage in our pipe – 30 ns. This would allow us to initiate a new instruction every 30 ns with a latency through the pipe of 30 ns x 4 stages = 120 ns. We could also pick a finer clock cycle that more closely matches the shortest stage (9 ns) but is integrally divisible into the other stages. A clock of 10 ns would be a good match and would require three clocks for the first stage, 1 clock for the second, 2 clocks for the third and 1 clock for the fourth. This would allow us to initiate a new instruction every 30 ns but provide a latency of 70 ns rather than 120. Either 30 ns or 10 ns is acceptable. Question 3. For the pipeline in question 2, how frequently can we initiate the execution of a new instruction, and what is the latency? See answer to question 2. Question 4. What is the speedup of the pipeline in question 2? Speedup per Stone's preferred definition is (30 + 9 + 20 + 10)/30 = 2.3 Speedup per best clocked definition is (30 + 10 + 20 + 10)/30 = 2.33

Upload: lukechung

Post on 20-Oct-2015

65 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: frtyuiop

CSC506 Pipeline Homework – due Wednesday, June 9, 1999

Question 1. An instruction requires four stages to execute: stage 1 (instruction fetch)requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) =20 ns and stage 4 (store results) = 10 ns. An instruction must proceed through thestages in sequence. What is the minimum asynchronous time for any singleinstruction to complete?

• 30 + 9 + 20 + 10 = 69 ns.

Question 2. We want to set this up as a pipelined operation. How many stages shouldwe have and at what rate should we clock the pipeline?

• We have 4 natural stages given and no information on how we might be able tofurther subdivide them, so we use 4 stages in our pipeline. We have a choice ofwhat clock rate to use. The simplest choice would be to use a clock cycle thataccommodates the longest stage in our pipe – 30 ns. This would allow us toinitiate a new instruction every 30 ns with a latency through the pipe of 30 ns x 4stages = 120 ns. We could also pick a finer clock cycle that more closelymatches the shortest stage (9 ns) but is integrally divisible into the other stages.A clock of 10 ns would be a good match and would require three clocks for thefirst stage, 1 clock for the second, 2 clocks for the third and 1 clock for the fourth.This would allow us to initiate a new instruction every 30 ns but provide a latencyof 70 ns rather than 120. Either 30 ns or 10 ns is acceptable.

Question 3. For the pipeline in question 2, how frequently can we initiate the executionof a new instruction, and what is the latency?

• See answer to question 2.

Question 4. What is the speedup of the pipeline in question 2?

• Speedup per Stone's preferred definition is (30 + 9 + 20 + 10)/30 = 2.3

• Speedup per best clocked definition is (30 + 10 + 20 + 10)/30 = 2.33

Page 2: frtyuiop

Question 5. Draw the reduced state-diagram and show the maximum-rate cycle usingthe following collision vector:

1 0 0 0 1 1

1 1 1 1 1 11 1 0 0 1 1

1 1 1 0 1 1

27

4

3

7

4

3

77

1 0 1 1 1 1

2

4

The maximum-rate cycle is the sequence 3, 4, 3, 4, 3, . . . giving two operationsinitiated every seven cycles, or 0.29 ops/cycle. The greedy cycle is 2,2,7,2,2,7,2,2, . . .giving three operations initiated every 11 cycles, or 0.27 ops/cycle. This is a casewhere the greedy cycle is not the optimum.

1 0 0 0 1 1

Page 3: frtyuiop

Question 6. We have a RISC processor with register-register arithmetic instructionsthat have the format R1 ß R2 op R3. The pipeline for these instructions runs with a100 MHz clock with the following stages: instruction fetch = 2 clocks, instructiondecode = 1 clock, fetch operands = 1 clock, execute = 2 clocks, and store result = 1clock.

a) At what rate (in MIPS) can we execute register-register instructions that have nodata dependencies with other instructions?

b) At what rate can we execute the instructions when every instruction depends on theresults of the previous instruction?

c) We implement internal forwarding. At what rate can we now execute the instructionswhen every instruction depends on the results of the previous instruction?

Op 1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch 1 1 2 2 3 3 4 4 5 5 6 6

Inst Decode 1 2 3 4 5

Op Fetch 1 2 3 4 5

Execute 1 1 2 2 3 3 4 4

Op Store 1 2 3

• a) No dependencies rate = 1 inst/2 cycles at 100 MHz clock = 50 MIPS.

Page 4: frtyuiop

Op 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Inst Fetch 1 1 2 2 3 3 4 4 5 5 wait wait 6 6 wait

Inst Decode 1 2 3 wait 4 wait wait wait 5 wait wait

Op Fetch 1 wait wait 2 wait wait wait 3 wait wait wait

Execute 1 1 2 2 3 3

Op Store 1 2 3

• b) Dependencies rate = 1 inst/4 cycles = 25 MIPS. The reservation table showsthat, although we begin fetching instructions every two cycles, the Operand Fetchunit must wait until the prior instruction stores its result before it can retrieve one ofits operands (e.g. Op Fetch for #2 must wait until Op Store for #1 completes). As aresult, things begin backing up in the pipeline, and we produce one instruction outputonly every 4 cycles.

Op 1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch 1 1 2 2 3 3

Inst Decode 1 2 3

Op Fetch 1 2 3

Execute 1 1 2 2 3 3

Op Store 1 2 3

• c) Dependencies with internal forwarding rate = 1 inst/2 cycles = 50 MIPS. If weimplement internal forwarding, the operand fetch unit can bypass fetching thedependent operand and just rename the dependent operand input register to be theresult of instruction 1. The result is available in time for the next calculation; we justhave to point one of the inputs for instruction 2 execution to the internal register thatreceives the output of instruction 1 in order to get it. We can then proceed withoutwaiting.

Page 5: frtyuiop

Question 7. Conditional branches are a problem with instruction pipelines. For theRISC processor described in question 6, we decide to implement branches byalways assuming the branch will not be taken rather than implementing some formof branch prediction or speculative execution, and we do not implement internalforwarding. We don't know that the instruction is a branch until stage 2 (decode),we don't know the condition code setting (for instructions that set the conditioncode) until stage 5 (operand store) is complete, and we can't provide the targetaddress (of a branch taken) to stage 1 until the end of stage 5. Assume a sequenceof instructions where the condition code setting instruction immediately precedesthe conditional branch.

a) What penalty in lost cycles do we incur for the branch not taken?

b) What penalty in lost cycles do we incur for the branch taken?

c) We implement delayed branching and the conditional branch is a delayedconditional branch. What penalty in lost cycles do we incur for the delayed branchtaken?

d) We implement internal forwarding along with the delayed branch. What penalty inlost cycles do we incur for the delayed branch taken with internal forwarding?

Op 1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch CC CC BR BR NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

Inst Decode CC BR NSI hold 2SI hold 3SI hold

Op Fetch CC wait wait BR NSI hold 2SI hold

Execute CC CC BR BR NSI NSI

Op Store CC BR

• a) We have a data dependency between the CC instruction and the branchinstruction. The operand fetch unit must wait 2 cycles until the CC is stored by theoperand store unit before fetching it for use by the branch instruction. The penaltydepends on how we implement the pipeline. If we force the operand fetch 2 cycledelay up the pipeline, we introduce a two cycle delay, even when the branch is nottaken. Penalty of 2 cycles for a branch not taken.

Page 6: frtyuiop

• However, we have two cycles of buffering in the pipeline – one cycle in theinstruction decode unit (it waits every other cycle) and one cycle in the operand fetchunit (it also waits every other cycle). If these units can each hold onto their resultsfor a cycle until the next stage is available as shown in the reservation table, we takeno penalty for this particular instruction pair. However, we will end up taking the 2cycle penalty when the next two (and every succeeding) instruction pair with datadependencies come along. Penalty of 2 cycles for a branch not taken.

Op 1 2 3 4 5 6 7 8 9 10 11 12 13

Inst Fetch CC CC BR BR NSI NSI 2SI 2SI 3SI 3SI wait BT BT

Inst Decode CC BR NSI hold 2SI hold

Op Fetch CC wait wait BR NSI hold

Execute CC CC BR BR

Op Store CC BR

• b) The operand fetch unit must still wait 2 cycles until the CC is available from theoperand store unit. By the time we know the outcome of the branch instruction(clock 10), we have fetched the next three sequential instructions. We must stop theexecution of NSI, dump the 2SI and 3SI instructions, and stop the instruction fetchunit from fetching the fourth sequential instruction after the branch, for 6 wastedclocks. Since the instruction fetch unit can’t get the new program counter addressfor the branch target (BT) instruction until clock 11, it can’t begin fetching theinstruction at the target address until clock 12, so we make it wait one more cycle.The total penalty for the branch taken is 7 cycles. If you assumed that theinstruction fetch unit could not be stopped after fetching 3SI and proceeded to fetch4SI, the total penalty is 8 cycles.

Page 7: frtyuiop

Op 1 2 3 4 5 6 7 8 9 10 11 12 13

Inst Fetch CC CC BA BA NSI NSI 2SI 2SI 3SI 3SI wait BT BT

Inst Decode CC BA NSI hold 2SI hold

Op Fetch CC wait wait BA NSI hold

Execute CC CC BA BA NSI NSI

Op Store CC BA NSI

• c) The difference here is that we do not need to stop the execution of NSI on adelayed branch. It can continue to completion, but we still need to dump the 2SI and3SI instructions, and stop the instruction fetch unit from fetching the fourth sequentialinstruction after the branch. The instruction fetch unit still can’t get the new programcounter address until clock 11, and it can’t begin fetching the instruction at the targetaddress until clock 12, so the total penalty for the branch taken is only 5 cycles,the four we lost by fetching the 2SI and 3SI instructions, and the one it had to waitbefore proceeding with the new PC. Again, if you assumed that 4SI was fetched, itwould be 6 cycles.

Op 1 2 3 4 5 6 7 8 9 10 11 12 13

Inst Fetch CC CC BA BA NSI NSI 2SI 2SI BT BT 2T 2T 3T

Inst Decode CC BA NSI

Op Fetch CC BA NSI

Execute CC CC BA BA NSI NSI

Op Store CC BA NSI

• d) Internal forwarding allows us to forward the condition code result directly fromthe CC Execute stage (clock 6) to the branch Execute stage (clock 7), so we don’tdelay the branch. We can also forward the Branch Target address directly fromthe output of the Branch Execute stage (clock 8) to the Instruction Fetch unit so wedon’t lose the branch Operand Store cycle in clock 9. We still need to dump the 2SIinstruction that we pre-fetched, so the total penalty for the delayed branch takenwith internal forwarding is only 2 cycles.

Page 8: frtyuiop

Question 8. What is a greedy cycle?

• The greedy cycle arises from initiating a new instruction into the pipeline at thefirst opportunity in each state. The greedy cycle is also the maximum-rate cyclein many cases, but not necessarily.

Question 9. Why would you implement a branch history table in a pipelined computer?

• A branch history table gives you a better guess than random on whether or not aconditional branch will be taken. The assumption is that recent history is a goodpredictor of the near future, the same idea that the LRU cache replacementalgorithm is based on. If we have a long instruction pipeline, a good guess willreduce the number of times we have to discard instructions that we prefetch andstart into the pipeline following a conditional branch.

Question 10. What do we mean when we say a computer is superscalar?

• A superscalar computer executes more than one instruction per clock tick. This isachieved by having more than one pipeline and allowing instructions withoutdependencies on one another to proceed in parallel through the separatepipelines.

Question 11. What problem is speculative execution trying to solve?

• Speculative execution is another strategy used to reduce the effects ofconditional branches. Rather than guessing which way a branch will go andfetching instructions only along one path, we proceed to fetch, decode, and beginexecution of instructions along both paths. Results from both instruction streamsare tentative until we know which way the branch goes. When the outcome ofthe branch is known, the tentative results from the path not taken are discardedand the results from the path taken are made permanent.