a hazard is created whenever there is a

8/8/2019 A Hazard is Created Whenever There is A

http://slidepdf.com/reader/full/a-hazard-is-created-whenever-there-is-a 1/32

A hazard is created whenever there is a dependence between instructions, and they are

close enough that the overlap caused by pipelining would change the order of access to

an operand. Our example hazards have all been with register operands, but it is also possible to create a dependence by writing and reading the same memory location. In

DLX pipeline, however, memory references are always kept in order, preventing this type

of hazard from arising.

All the data hazards discussed here involve registers within the CPU. By convention, the

hazards are named by the ordering in the program that must be preserved by the

pipeline.

RAW (read after write) WAW (write after write)

WAR (write after read)

Consider two instructions i and j , with i occurring before j . The possible data hazards are:

RAW (read after write) - j tries to read a source before i writes it, so j incorrectly gets

the old value.

This is the most common type of hazard and the kind that we use forwarding to

overcome.

WAW (write after write) - j tries to write an operand before it is written by i . Thewrites end up being performed in the wrong order, leaving the value written by i rather

than the value written by j in the destination.

This hazard is present only in pipelines that write in more than one pipe stage or allow an

instruction to proceed even when a previous instruction is stalled. The DLX integer

pipeline writes a register only in WB and avoids this class of hazards.

WAW hazards would be possible if we made the following two changes to the DLX

pipeline:

move write back for an ALU operation into the MEM stage, since the data value is

available by then.suppose that the data memory access took two pipe stages.

Here is a sequence of two instructions showing the execution in this revised pipeline,

highlighting the pipe stage that writes the result:

LW R1, 0(R2) IF ID EX MEM1 MEM2 WB

ADD R1, R2, R3 IF ID EX WB

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHaz.html#example

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html#RAW

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html#WAW

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html#WAR

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHaz.html#example

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html#RAW

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html#WAW

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html#WAR




Unless this hazard is avoided, execution of this sequence on this revised pipeline will

leave the result of the first write (the LW) in R1, rather than the result of the ADD.

Allowing writes in different pipe stages introduces other problems, since two instructionscan try to write during the same clock cycle. The DLX FP pipeline , which has both

writes in different stages and different pipeline lengths, will deal with both write conflictsand WAW hazards in detail.

WAR (write after read) - j tries to write a destination before it is read by i , so i

incorrectly gets the new value.

This can not happen in our example pipeline because all reads are early (in ID) and all

writes are late (in WB). This hazard occurs when there are some instructions that writeresults early in the instruction pipeline, and other instructions that read a source late in the

pipeline.

Because of the natural structure of a pipeline, which typically reads values before itwrites results, such hazards are rare. Pipelines for complex instruction sets that support

autoincrement addressing and require operands to be read late in the pipeline could create

a WAR hazards.

If we modified the DLX pipeline as in the above example and also read some operandslate, such as the source value for a store instruction, a WAR hazard could occur. Here is

the pipeline timing for such a potential hazard, highlighting the stage where the conflict

occurs:

SW R1, 0(R2) IF ID EX MEM1 MEM2 WBADD R2, R3, R4 IF ID EX WB

If the SW reads R2 during the second half of its MEM2 stage and the Add writes R2during the first half of its WB stage, the SW will incorrectly read and store the value

produced by the ADD.

RAR (read after read) - this case is not a hazard :).

Unfortunately, not all potential hazards can be handled by forwarding.

Consider the following sequence of instructions:

1 2 3 4 5 6 7 8

LW R1, 0(R1) IF ID EX MEM WB

SUB R4, R1, R5 IF ID EXsub MEM WB

AND R6, R1 R7 IF ID EXand MEM WB





OR R8, R1, R9 IF ID EX MEM WB

The LW instruction does not have the data until the end of clock cycle 4 (MEM) , while

the SUB instruction needs to have the data by the beginning of that clock cycle (EXsub).

For AND instruction we can forward the result immediately to the ALU (EXand) from theMEM/WB register(MEM).

OR instruction has no problem, since it receives the value through the register file (ID).

In clock cycle no. 5, the WB of the LW instruction occurs "early" in first half of the cycle

and the register read of the OR instruction occurs "late" in the second half of the cycle.

For SUB instruction, the forwarded result would arrive too late - at the end of a clock cycle, when needed at the beginning.

The load instruction has a delay or latency that cannot be eliminated by forwarding alone.Instead, we need to add hardware, called a pipeline interlock , to preserve the correctexecution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline

until the hazard is cleared.

The pipeline with a stall and the legal forwarding is:

1 2 3 4 5 6 7 8 9

LW R1, 0(R1) IF ID EX MEM WB

SUB R4, R1, R5 IF ID stall EXsub MEM WBAND R6, R1 R7 IF stall ID EX MEM WB

OR R8, R1, R9 stall IF ID EX MEM WB

The only necessary forwarding is done for R1 from MEM to EXsub. Notice that there is no need to forward R1 for AND instruction because now it is getting

the value through the register file in ID (as OR above).

There are techniques to reduce number of stalls even in this case, which we consider next

Generate DLX code that avoids pipeline stalls for the following sequence of statements:

a = b + c ;

d = a - f ;

e = g - h ;

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/compSched.html





Assume that all variables are 32-bit integers. Wherever necessary, explicitly explain the

actions that are needed to avoid pipeline stalls in your scheduled code.

Solution:

The DLX assembly code for the given sequence of statements is :

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

LW

Rb, bIF ID EX M WB

LW

Rc, cIF ID EX M WB

Add

Ra,Rb,Rc

IF ID stall EX M WB

SW

Ra, aIF stall ID EX M WB

LW Rf,

f stall IF ID EX M WB

SubRd, Ra,

Rf

IF ID stall EX M WB

SW

Rd, dIF stall ID EX M WB

LWRg, g

stall IF ID EX M WB

LW

Rh, hIF ID EX M WB

SubRe, Rg,

Rh

IF ID stall EX M WB

SW

Re, eIF stall ID EX M WB

Running this code segment will need some forwarding. But instructions LW and

ALU(Add or Sub), when put in sequence, are generating hazards for the pipeline that cannot be resolved by forwarding. So the pipeline will stall. Observe that in time steps 4, 5,

and 6, there are two forwards from the Data memory unit to the ALU in the EX stage of

the Add instruction. So also the case in time steps 13, 14, and 15. The hardware toimplement this forwarding will need two Load Memory Data registers to store the output



of data memory. Note that for the SW instructions, the register value is needed at the

input of Data memory. The better solution with compiler assist is given below.

Rather then just allow the pipeline to stall, the compiler could try to schedule the pipelineto avoid these stalls by rearranging the code sequence to eliminate the hazards.

Suggested version is (the problem has actually more than one solution) :

Instructio

n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Explanation

LW Rb, b IF ID EX M WB

LW Rc, c IF ID EX M WB

LW Rf, f IF ID EX M WB

Add Ra,

Rb, RcIF ID EX M WB

Rb read in

second half of ID;

Rc

forwarded

SW Ra, a IF ID EX M WBRa

forwarded

Sub Rd,

Ra, Rf IF ID EX M WB

Rf read insecond half

of ID;

Ra

forwardedLW Rg, g IF ID EX M WB

LW Rh, h IF ID EX M WB

SW Rd, d IF ID EX M WBRd read in

second half

of ID;

Sub Re,

Rg, RhIF ID EX M WB

Rg read in

second half

of ID;

Rh

forwarded

SW Re, e IF ID EX M WBRe

forwarded

The same color is used to outline the source and destination of forwarding.

The blue color is used to indicate the technique to perform the register file reads in thesecond half of a cycle, and the writes in the first half.



Note: Notice that the use of different registers for the first, second and third statements

was critical for this schedule to be legal! In general, pipeline scheduling can increase the

register count required.

. Control hazards can cause a greater performance loss for DLX pipeline than data

hazards. When a branch is executed, it may or may not change the PC (program counter)to something other than its current value plus 4. If a branch changes the PC to its target

address, it is a taken branch; if it falls through, it is not taken.

If instruction i is a taken branch, then the PC is normally not changed until the end of

MEM stage, after the completion of the address calculation and comparison (see

diagram).

The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the MEM stage, which determines the new PC. The

pipeline behavior looks like :

Branch IF ID EX MEM WB

Branch successor IF(stall) stall stall IF ID EX MEM WB

Branch successor+1 IF ID EX MEM WB

The stall does not occur until after ID stage (where we know that the instruction is a branch).

This control hazards stall must be implemented differently from a data hazard, since the

IF cycle of the instruction following the branch must be repeated as soon as we knowthe branch outcome. Thus, the first IF cycle is essentially a stall (because it never performs useful work), which comes to total 3 stalls.

Three clock cycles wasted for every branch is a significant loss. With a 30% branch

frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the

ideal speedup from pipelining!

The number of clock cycles can be reduced by two steps:

Find out whether the branch is taken or not taken earlier in the pipeline;

Compute the taken PC (i.e., the address of the branch target) earlier .Both steps should be taken as early in the pipeline as possible.

By moving the zero test into the ID stage, it is possible to know if the branch is taken atthe end of the ID cycle. Computing the branch target address during ID requires an

additional adder, because the main ALU, which has been used for this function so far, is

not usable until EX.

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/DLXpipe.html#datapath






The revised datapath :

Data HazardsA major effect of pipelining is to change the relative timing of instructions by

overlapping their execution. This introduces data and control hazards. Data hazards

occur when the pipeline changes the order of read/write accesses to operands so that theorder differs from the order seen by sequentially executing instructions on the

unpipelined machine.

Consider the pipelined execution of these instructions:

1 2 3 4 5 6 7 8 9

ADD R1, R2, R3 IF ID EX MEM WB

SUB R4, R5, R1 IF IDsub EX MEM WB

AND R6, R1, R7 IF IDand EX MEM WB

OR R8, R1, R9 IF IDor EX MEM WB

XOR R10,R1,R11 IF IDxor EX MEM WB

All the instructions after the ADD use the result of the ADD instruction (in R1). The

ADD instruction writes the value of R1 in the WB stage (shown black), and the SUBinstruction reads the value during ID stage (IDsub). This problem is called a data hazard .Unless precautions are taken to prevent it, the SUB instruction will read the wrong value

and try to use it.

The AND instruction is also affected by this data hazard. The write of R1 does not

complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads theregisters during cycle 4 (IDand) will receive the wrong result.

The OR instruction can be made to operate without incurring a hazard by a simple

implementation technique. The technique is to perform register file reads in the second

half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file by ADD will perform in the first

half of the cycle, and the read of registers by OR will perform in the second half of the

cycle.

The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD.



The next page discusses forwarding, a technique to eliminate the stalls for the hazard

involving the SUB and AND instructions.

We will also classify the data hazards and consider the cases when stalls can not beeliminated. We will see what compiler can do to schedule the pipeline to avoid stalls.

Hazard (computer architecture)

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Hazards are problems with the instruction pipeline in central processing unit (CPU)

microarchitectures that potentially result in incorrect computation. There are typically

three types of hazards:

•

data hazards• structural hazards

• control hazards (branching hazards)

There are several methods used to deal with hazards, including pipeline stalls (pipeline

bubbling), register forwarding, and in the case of out-of-order execution, thescoreboarding method and the Tomasulo algorithm.



http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html

http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/reqStalls.html




http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#mw-head


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#p-search

http://en.wikipedia.org/wiki/Instruction_pipeline

http://en.wikipedia.org/wiki/Central_processing_unit

http://en.wikipedia.org/wiki/Microarchitecture


http://en.wikipedia.org/wiki/Pipeline_stall


http://en.wikipedia.org/w/index.php?title=Register_forwarding&action=edit&redlink=1

http://en.wikipedia.org/wiki/Out-of-order_execution


http://en.wikipedia.org/wiki/Scoreboarding

http://en.wikipedia.org/wiki/Tomasulo_algorithm


http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/dataHazClass.html





http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#p-search





http://en.wikipedia.org/w/index.php?title=Register_forwarding&action=edit&redlink=1






Contents

[hide]

• 1 Background

• 2 Typeso 2.1 Data hazards

2.1.1 Read After Write (RAW)

2.1.1.1 Example

2.1.2 Write After Read (WAR) 2.1.2.1 Example

2.1.3 Write After Write (WAW)

2.1.3.1 Exampleo 2.2 Structural hazards

o 2.3 Control hazards (branch hazards)

• 3 Eliminating hazards

o 3.1 Generic 3.1.1 Pipeline bubbling

o 3.2 Data hazards

3.2.1 Register forwarding

3.2.1.1 Exampleo 3.3 Control hazards (branch hazards)

• 4 References

• 5 See also

[ edit ] Background

Further information: instruction pipeline

Instructions in a pipelined processor are performed in several stages, so that at any given

time several instructions are being processed in the various stages of the pipeline, such as

fetch and execute. There are many different instruction pipeline microarchitectures, andinstructions may be executed out-of-order . A hazard occurs when two or more of these

simultaneous (possibly out of order) instructions conflict.

[ edit ] Types

[edit] Data hazards

Data hazards occur when instructions that exhibit data dependence modify data indifferent stages of a pipeline. Ignoring potential data hazards can result in race conditions

(sometimes known as race hazards). There are three situations in which a data hazard can

occur:

1. read after write (RAW), a true dependency

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Background

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Types

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Read_After_Write_.28RAW.29

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Example

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_After_Read_.28WAR.29

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Example_2

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_After_Write_.28WAW.29


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Structural_hazards

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Control_hazards_.28branch_hazards.29

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Eliminating_hazards

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Generic

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Pipeline_bubbling

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards_2

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Register_forwarding


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Control_hazards_.28branch_hazards.29_2

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#References

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#See_also

http://en.wikipedia.org/w/index.php?title=Hazard_(computer_architecture)&action=edit&section=1



http://en.wikipedia.org/wiki/Instruction_(computer_science)







http://en.wikipedia.org/wiki/Data_dependence

http://en.wikipedia.org/wiki/Race_condition


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Background

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Types

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Read_After_Write_.28RAW.29

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Example

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_After_Read_.28WAR.29


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_After_Write_.28WAW.29


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Structural_hazards

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Control_hazards_.28branch_hazards.29

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Eliminating_hazards

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Generic

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Pipeline_bubbling

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards_2

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Register_forwarding


http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Control_hazards_.28branch_hazards.29_2

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#References

http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#See_also









http://en.wikipedia.org/wiki/Race_condition



2. write after read (WAR)

3. write after write (WAW)

consider two instructions i and j, with i occurring before j in program order.

[edit] Read After Write (RAW)

(j tries to read a source before i writes to it) A read after write (RAW) data hazard refers

to a situation where an instruction refers to a result that has not yet been calculated or

retrieved. This can occur because even though an instruction is executed after a previous

instruction, the previous instruction has not been completely processed through the pipeline.

[ edit ] Example

For example:

i1. R2 <- R1 + R3i2. R4 <- R2 + R3

The first instruction is calculating a value to be saved in register 2, and the second is

going to use this value to compute a result for register 4. However, in a pipeline, when we

fetch the operands for the 2nd operation, the results from the first will not yet have beensaved, and hence we have a data dependency.

We say that there is a data dependency with instruction 2, as it is dependent on the

completion of instruction 1.

[edit] Write After Read (WAR)

(j tries to write a destination before it is read by i) A write after read (WAR) data hazard

represents a problem with concurrent execution.

[ edit ] Example

For example:

i1. R4 <- R1 + R3i2. R3 <- R1 + R2

If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with

concurrent execution) we must ensure that we do not store the result of register 3 before

i1 has had a chance to fetch the operands.



http://en.wikipedia.org/wiki/Pipeline_(computing)










[edit] Write After Write (WAW)

(j tries to write an operand before it is written by i) A write after write (WAW) data

hazard may occur in a concurrent execution environment.

[ edit ] Example

For example:

i1. R2 <- R1 + R2i2. R2 <- R4 + R7

We must delay the WB (Write Back) of i2 until the execution of i1.

[edit] Structural hazards

A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is

accessed both in the fetch stage where an instruction is retrieved from memory, and thememory stage where data is written and/or read from memory.[1] They can often be

resolved by separating the component into orthogonal units (such as separate caches) or

bubbling the pipeline.

[edit] Control hazards (branch hazards)

Further information: branch (computer science)

Branching hazards (also known as control hazards) occur with branches. On many

instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline (normally the fetch

stage).

[ edit ] Eliminating hazards

[edit] Generic

[edit] Pipeline bubbling

Bubbling the pipeline, also known as a pipeline break or a pipeline stall , is a method for

preventing data, structural, and branch hazards from occurring. As instructions arefetched, control logic determines whether a hazard could/will occur. If this is true, then

the control logic inserts NOPs into the pipeline. Thus, before the next instruction (which

would cause the hazard) is executed, the previous one will have had sufficient time tocomplete and prevent the hazard. If the number of NOPs is equal to the number of stages

in the pipeline, the processor has been cleared of all instructions and can proceed free

from hazards. This is called flushing the pipeline. All forms of stalling introduce a delay before the processor can resume execution.


http://en.wikipedia.org/wiki/Concurrent_computing



http://en.wikipedia.org/wiki/Hazard_(computer_architecture)#cite_note-hennessey_p336-0


http://en.wikipedia.org/wiki/Orthogonal


http://en.wikipedia.org/wiki/Bubbling_the_pipeline


http://en.wikipedia.org/wiki/Branch_(computer_science)






http://en.wikipedia.org/wiki/NOP


http://en.wikipedia.org/wiki/Concurrent_computing





http://en.wikipedia.org/wiki/Bubbling_the_pipeline







http://en.wikipedia.org/wiki/NOP



[edit] Data hazards

There are several main solutions and algorithms used to resolve data hazards:

• insert a pipeline bubble whenever a read after write (RAW) dependency is

encountered, guaranteed to increase latency, or • utilize out-of-order execution to potentially prevent the need for pipeline bubbles

• utilize register forwarding to use data from later stages in the pipeline

In the case of out-of-order execution, the algorithm used can be:

• scoreboarding, in which case a pipeline bubble will only be needed when there isno functional unit available

• the Tomasulo algorithm, which utilizes register renaming allowing the continual

issuing of instructions

We can delegate the task of removing data dependencies to the compiler, which can fill inan appropriate number of NOP instructions between dependent instructions to ensure

correct operation, or re-order instructions where possible.

[edit] Register forwarding

Forwarding involves feeding output data into a previous stage of the pipeline. Forwardingis implemented by feeding back the output of an instruction into the previous stage(s) of

the pipeline as soon as the output of that instruction is available.

[ edit ] Example

NOTE: In the following examples, computed values are in bold , while Register numbers are not.

For instance, let's say we want to write the value 3 to register 1, (which already contains a

6), and then add 7 to register 1 and store the result in register 2, i.e.:

Instruction 0: Register 1 = 6


Instruction 2: Register 2 = Register 1 + 7 = 10

Following execution, register 2 should contain the value 10. However, if Instruction 1

(write 3 to register 1) does not completely exit the pipeline before Instruction 2 startsexecution, it means that Register 1 does not contain the value 3 when Instruction 2

performs its addition. In such an event, Instruction 2 adds 7 to the old value of register 1

(6), and so register 2 would contain 13 instead, i.e:


Instruction 2: Register 2 = Register 1 + 7 = 13









http://en.wikipedia.org/wiki/Register_renaming













This error occurs because Instruction 2 reads Register 1 before Instruction 1 has

committed/stored the result of its write operation to Register 1. So when Instruction 2 is

reading the contents of Register 1, register 1 still contains 6, not 3.

Forwarding (described below) helps correct such errors by depending on the fact that the

output of Instruction 1 (which is 3) can be used by subsequent instructions before thevalue 3 is committed to/stored in Register 1.

Forwarding applied to our example means that we do not wait to commit/store the output

of Instruction 1 in Register 1 (in this example, the output is 3 ) before making that output

available to the subsequent instruction (in this case, Instruction 2). The effect is that

Instruction 2 uses the correct (the more recent) value of Register 1: the commit/store was

made immediately and not pipelined.

With forwarding enabled, the ID/EX or Instruction Decode/Execution stage of the

pipeline now has two inputs: the value read from the register specified (in this example,

the value 6 from Register 1), and the new value of Register 1 (in this example, this valueis 3) which is sent from the next stage (EX/MEM) or Instruction Execute/Memory

Access. Additional control logic is used to determine which input to use.

[edit] Control hazards (branch hazards)

To avoid control hazards microarchitectures can:

• insert a pipeline bubble (discussed above), guaranteed to increase latency, or

• use branch prediction and essentially guesstimate which instructions to insert, in

which case a pipeline bubble will only be needed in the case of an incorrect

prediction

In the event that a branch causes a pipeline bubble after incorrect instructions haveentered the pipeline, care must be taken to prevent any of the wrongly-loaded instructions

from having any effect on the processor state excluding energy wasted processing them

before they were discovered to be loaded incorrectly.

Tomasulo algorithm



The Tomasulo algorithm is a hardware algorithm developed in 1967 by RobertTomasulo from IBM. It allows sequential instructions that would normally be stalled due

to certain dependencies to execute non-sequentially (out-of-order execution). It was first

implemented for the IBM System/360 Model 91’s floating point unit.


http://en.wikipedia.org/wiki/Latency

http://en.wikipedia.org/wiki/Branch_prediction


http://en.wikipedia.org/wiki/Guesstimate

http://en.wikipedia.org/wiki/Tomasulo_algorithm#mw-head


http://en.wikipedia.org/wiki/Tomasulo_algorithm#p-search

http://en.wikipedia.org/wiki/Algorithm

http://en.wikipedia.org/wiki/Robert_Tomasulo


http://en.wikipedia.org/wiki/IBM


http://en.wikipedia.org/wiki/IBM_System/360


http://en.wikipedia.org/wiki/Latency


http://en.wikipedia.org/wiki/Guesstimate


http://en.wikipedia.org/wiki/Tomasulo_algorithm#p-search

http://en.wikipedia.org/wiki/Algorithm








This algorithm differs from scoreboarding in that it utilizes register renaming. Where

scoreboarding resolves Write-after-Write (WAW) and Write-after-Read (WAR) hazards

by stalling, register renaming allows the continual issuing of instructions. The Tomasuloalgorithm also uses a common data bus (CDB) on which computed values are broadcast

to all the reservation stations that may need it. This allows for improved parallel

execution of instructions which may otherwise stall under the use of scoreboarding.

Robert Tomasulo received the Eckert-Mauchly Award in 1997 for this algorithm.

Contents

[hide]

• 1 Implementation concepts

• 2 Instruction lifecycleo 2.1 Stage 1: issue

o 2.2 Stage 2: executeo 2.3 Stage 3: write result

• 3 See also

• 4 External links

• 5 Bibliography

[ edit ] Implementation concepts

The following are the concepts necessary to the implementation of Tomasulo's

Algorithm.

• Instructions are issued sequentially so that the effects of a sequence of instructions

such as exceptions raised by these instructions occur in the same order as theywould in a non-pipelined processor, regardless of the fact that they are being

executed non-sequentially.

• All general-purpose and reservation station registers hold either real or virtual

values. If a real value is unavailable to a destination register during the issuestage, a virtual value is initially used. The functional unit that is computing the

real value is assigned as the virtual value. The virtual register values are converted

to real values as soon as the designated functional unit completes its computation.

• Functional units use reservation stations with multiple slots. Each slot holdsinformation needed to execute a single instruction, including the operation and the

operands. The functional unit begins processing when it is free and when all

source operands needed for an instruction are real.





http://en.wikipedia.org/w/index.php?title=Common_data_bus&action=edit&redlink=1

http://en.wikipedia.org/wiki/Reservation_stations

http://en.wikipedia.org/wiki/Eckert-Mauchly_Award


http://en.wikipedia.org/wiki/Tomasulo_algorithm#Implementation_concepts

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Instruction_lifecycle

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Stage_1:_issue

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Stage_2:_execute

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Stage_3:_write_result

http://en.wikipedia.org/wiki/Tomasulo_algorithm#See_also

http://en.wikipedia.org/wiki/Tomasulo_algorithm#External_links

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Bibliography

http://en.wikipedia.org/w/index.php?title=Tomasulo_algorithm&action=edit&section=1







http://en.wikipedia.org/w/index.php?title=Common_data_bus&action=edit&redlink=1


http://en.wikipedia.org/wiki/Eckert-Mauchly_Award


http://en.wikipedia.org/wiki/Tomasulo_algorithm#Implementation_concepts

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Instruction_lifecycle

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Stage_1:_issue

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Stage_2:_execute

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Stage_3:_write_result

http://en.wikipedia.org/wiki/Tomasulo_algorithm#See_also

http://en.wikipedia.org/wiki/Tomasulo_algorithm#External_links

http://en.wikipedia.org/wiki/Tomasulo_algorithm#Bibliography






[ edit ] Instruction lifecycle

The three stages listed below are the stages through which each instruction passes from

the time it is issued to the time its execution is complete.

[edit] Stage 1: issue

In the issue stage, instructions are issued for execution if all operands and reservation

stations are ready or else they are stalled. Registers are renamed in this step, eliminating

WAR and WAW hazards.

• Retrieve the next instruction from the head of the instruction queue. If theinstruction operands are currently in the registers

o If there is a matching empty reservation station (i.e., functional unit is

available) then: issue the instructiono Else, there is not a matching empty reservation station (i.e., functional unit

is not available) then: stall the instruction until a station or buffer is free• Else, the operands are not in the registers, then: use virtual values, the functional

unit calculating the real value, to keep track of the functional units that will

produce the operand

[edit] Stage 2: execute

In the execute stage, the instruction operations are carried out. Instructions are delayed inthis step until all of their operands are available, eliminating RAW hazards. Program

correctness is maintained through effective address calculation to prevent hazards

through memory.

1. If one or more of the operands is not yet available then: wait for operand to become available on the CDB.

2. When all operands are available, then: if the instruction is a load or store

1. Compute the effective address when the base register is available, and place it in the load/store buffer

2.

If the instruction is a load then: execute as soon as the memory unitis available, then:

Else, if the instruction is a store then: wait for the value to be

stored before sending it to the memory unit

Else, the instruction is an ALU operation then: execute theinstruction at the corresponding functional unit

[edit] Stage 3: write result

In the write Result stage, ALU operations results are written back to registers and storeoperations are written back to memory.











• If the instruction was an ALU operationo If the result is available, then: write it on the CDB and from there into the

registers and any reservation stations waiting for this result• Else, if the instruction was a store then: write the data to memory during this step

[ edit ]

Register renaming



In computer architecture, register renaming refers to a technique used to avoid

unnecessary serialization of program operations imposed by the reuse of registers by

those operations.

Contents

[hide]

• 1 Problem definition

• 2 Data hazards

• 3 Architectural vs physical registers

• 4 Details: tag-indexed register file

• 5 Details: reservation stations

• 6 Comparison between the schemes• 7 History

• 8 References

[ edit ] Problem definition

Programs are composed of instructions which operate on values. The instructions must

name these values in order to distinguish them from one another. A typical instruction

might say, add X and Y and put the result in Z. In this instruction, X, Y, and Z are thenames of storage locations.

In order to have a compact instruction encoding, most processor instruction sets have a

small set of special locations which can be directly named. For example, the x86

instruction set architecture has 8 integer registers, x86-64 has 16, many RISCs have 32,and IA-64 has 128. In smaller processors, the names of these locations correspond

directly to elements of a register file.


http://en.wikipedia.org/wiki/Register_renaming#mw-head


http://en.wikipedia.org/wiki/Register_renaming#p-search

http://en.wikipedia.org/wiki/Computer_architecture

http://en.wikipedia.org/wiki/Processor_register




http://en.wikipedia.org/wiki/Register_renaming#Problem_definition

http://en.wikipedia.org/wiki/Register_renaming#Data_hazards

http://en.wikipedia.org/wiki/Register_renaming#Architectural_vs_physical_registers

http://en.wikipedia.org/wiki/Register_renaming#Details:_tag-indexed_register_file

http://en.wikipedia.org/wiki/Register_renaming#Details:_reservation_stations

http://en.wikipedia.org/wiki/Register_renaming#Comparison_between_the_schemes

http://en.wikipedia.org/wiki/Register_renaming#History

http://en.wikipedia.org/wiki/Register_renaming#References

http://en.wikipedia.org/w/index.php?title=Register_renaming&action=edit&section=1

http://en.wikipedia.org/wiki/Register_file



http://en.wikipedia.org/wiki/Register_renaming#p-search

http://en.wikipedia.org/wiki/Computer_architecture



http://en.wikipedia.org/wiki/Register_renaming#Problem_definition

http://en.wikipedia.org/wiki/Register_renaming#Data_hazards

http://en.wikipedia.org/wiki/Register_renaming#Architectural_vs_physical_registers

http://en.wikipedia.org/wiki/Register_renaming#Details:_tag-indexed_register_file

http://en.wikipedia.org/wiki/Register_renaming#Details:_reservation_stations

http://en.wikipedia.org/wiki/Register_renaming#Comparison_between_the_schemes

http://en.wikipedia.org/wiki/Register_renaming#History

http://en.wikipedia.org/wiki/Register_renaming#References





Different instructions may take different amounts of time (e.g., CISC architecture). For

instance, a processor may be able to execute hundreds of instructions while a single load

from main memory is in progress. Shorter instructions executed while the load isoutstanding will finish first, thus the instructions are finishing out of the original program

order. Out-of-order execution has been used in most recent high-performance CPUs to

achieve some of their speed gains.

Consider this piece of code running on an out-of-order CPU:

1. R1=M[1024]

2. R1=R1+2

3. M[1032]=R1

4. R1=M[2048]

5. R1=R1+4

6. M[2056]=R1

Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor

cannot finish 4 until 3 is done, because 3 would then write the wrong value.

We can eliminate this restriction by changing the names of some of the registers:

1. R1=M[1024] 4. R2=M[2048]

2. R1=R1+2 5. R2=R2+4

3. M[1032]=R1 6. M[2056]=R2

Now instructions 4, 5, and 6 can be executed in parallel with instructions 1, 2, and 3, sothat the program can be executed faster.

When possible, the compiler performs this renaming. The compiler is constrained in

many ways, primarily by the finite number of register names in the instruction set. Many

high performance CPUs have more physical registers than may be named directly in theinstruction set, so they rename registers in hardware to achieve additional parallelism.

[ edit ] Data hazards

Main article: Data hazard

When more than one instruction references a particular location for an operand, either

reading it (as an input) or writing it (as an output), executing those instructions in anorder different from the original program order can lead to three kinds of data hazards:

Read-after-write (RAW)

A read from a register or memory location must return the value placed there by

the last write in program order, not some other write. This is referred to as a true




http://en.wikipedia.org/wiki/Data_hazard








dependency or flow dependency, and requires the instructions to execute in

program order.

Write-after-write (WAW)Successive writes to a particular register or memory location must leave that

location containing the result of the second write. This can be resolved by

squashing (synonyms: cancelling, annulling, mooting) the first write if necessary.WAW dependencies are also known as output dependencies.

Write-after-read (WAR)

A read from a register or memory location must return the last prior value written to that

location, and not one written programmatically after the read. This is the sort of false

dependency that can be resolved by renaming. WAR dependencies are also known as

anti-dependencies.

Instead of delaying the write until all reads are completed, two copies of the location can

be maintained, the old value and the new value. Reads that precede, in program order, the

write of the new value can be provided with the old value, even while other reads thatfollow the write are provided with the new value. The false dependency is broken and

additional opportunities for out-of-order execution are created. When all reads needingthe old value have been satisfied, it can be discarded. This is the essential concept behind

register renaming.

Anything that is read and written can be renamed. While the general-purpose and

floating-point registers are discussed the most, flag and status registers or even individualstatus bits are commonly renamed as well.

Memory locations can also be renamed, although it is not commonly done to the extent

practised in register renaming. The Transmeta Crusoe processor's gated store buffer is aform of memory renaming.

If programs refrained from reusing registers immediately, there would be no need for

register renaming. Some instruction sets (e.g., IA-64) specify very large numbers of

registers for specifically this reason. There are limitations to this approach:

• It is very difficult for the compiler to avoid reusing registers without large codesize increases. In loops, for instance, successive iterations would have to use

different registers, which requires replicating the code in a process called loop

unrolling (but see register rotation)

•

Large numbers of registers require lots of bits to specify those registers, makingthe code size increase.

• Many instruction sets historically specified smaller numbers of registers and

cannot be changed now.

Code size increases are important because when the program code is larger, the

instruction cache misses more often and the processor stalls waiting for new instructions.

http://en.wikipedia.org/wiki/Transmeta_Crusoe



http://en.wikipedia.org/wiki/IA-64

http://en.wikipedia.org/wiki/Loop_unrolling


http://en.wikipedia.org/w/index.php?title=Register_rotation&action=edit&redlink=1


http://en.wikipedia.org/wiki/IA-64



http://en.wikipedia.org/w/index.php?title=Register_rotation&action=edit&redlink=1



[ edit ] Architectural vs physical registers

Machine language programs specify reads and writes to a limited set of registers

specified by the instruction set architecture (ISA). For instance, the Alpha ISA specifies32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide.

These are the architectural registers. Programs written for processors running the Alphainstruction set will specify operations reading and writing those 64 registers. If a

programmer stops the program in a debugger, she or he can observe the contents of these64 registers (and a few status registers) to determine the progress of the machine.

One particular processor which implements this ISA, the Alpha 21264, has 80 integer and

72 floating-point physical registers. There are, on an Alpha 21264 chip, 80 physicallyseparate locations which can store the results of integer operations, and 72 locations

which can store the results of floating point operations. (In fact, there are even more

locations than that, but those extra locations are not germane to the register renaming

operation.)

Below are described two styles of register renaming, distinguished by the circuit which

holds data ready for an execution unit.

In all renaming schemes, the machine converts the architectural registers referenced in

the instruction stream into tags. Where the architectural registers might be specified by 3to 5 bits, the tags are usually a 6 to 8 bit number. The rename file must have a read port

for every input of every instruction renamed every cycle, and a write port for every

output of every instruction renamed every cycle. Because the size of a register filegenerally grows as the square of the number of ports, the rename file is usually physically

large and consumes significant power.

In the tag-indexed register file style, there is one large register file for data values,

containing one register for every tag. For example, if the machine has 80 physicalregisters, then it would use 7 bit tags. 48 of the possible tag values in this case are

unused.

In this style, when an instruction is issued to an execution unit, the tags of the source

registers are sent to the physical register file, where the values corresponding to thosetags are read and sent to the execution unit.

In the reservation station style, there are many small associative register files, usually

one at the inputs to each execution unit. Each operand of each instruction in an issuequeue has a place for a value in one of these register files.

In this style, when an instruction is issued to an execution unit, the register file entriescorresponding to the issue queue entry are read and forwarded to the execution unit.

Architectural Register File or Retirement Register File (RRF)


http://en.wikipedia.org/wiki/Instruction_set_architecture


http://en.wikipedia.org/wiki/DEC_Alpha

http://en.wikipedia.org/wiki/Alpha_21264




http://en.wikipedia.org/wiki/DEC_Alpha




The committed register state of the machine. RAM indexed by logical register

number. Typically written into as results are retired or committed out of a reorder

buffer.Future File

The most speculative register state of the machine. RAM indexed by logical

register number.Active Register File

The Intel P6 group's term for Future File.

History Buffer Typically used in combination with a future file. Contains the "old" values of

registers that have been overwritten. If the producer is still in flight it may be

RAM indexed by history buffer number. After a branch misprediction must use

results from the history buffer—either they are copied, or the future file lookup isdisabled and the history buffer is CAM indexed by logical register number.

Reorder Buffer (ROB)

Pretty much any structure that is sequentially (circularly) indexed on a per operation basis, for instructions in flight. Except… differs from a history buffer, in that the reorder buffer typically comes after the future file (if it exists) and before the architectural

register file.

Reorder buffers come in data-less and data-ful versions.

In Willamette's ROB, the ROB entries point to registers in the physical register file(PRF), and also contain other bookkeeping. This was also the first OOO design done by

Andy Glew, at Illinois with HaRRM.

In P6's ROB, the ROB entries contain data; there is no separate PRF. Data values fromthe ROB are copied from the ROB to the RRF at retirement.

One small detail: if there is temporal locality in ROB entries (i.e., if instructions close

together in the Von Neuman instruction sequence write back close together in time, it

may be possible to perform write combining on ROB entries and so have fewer ports than

a separate ROB/PRF would). It's not clear if it makes a difference, since a PRF should be banked.

ROBs usually don't have associative logic, and certainly none of the ROBs designed by

Andy Glew have CAMs. Keith Diefendorff insisted that ROBs have complex associative

logic for many years. The first ROB proposal may have had CAMs.

[ edit ] Details: tag-indexed register file

This file is a candidate for speedy deletion. It may be deleted at any time.

This is the renaming style used in the MIPS R10000, the Alpha 21264, and in the FP

section of the AMD Athlon.

http://en.wikipedia.org/wiki/Keith_Diefendorff


http://en.wikipedia.org/wiki/Wikipedia:Upload?wpDestFile=Register_renaming:tag_indexed_scheme.png

http://en.wikipedia.org/wiki/R10000



http://en.wikipedia.org/wiki/Keith_Diefendorff


http://en.wikipedia.org/wiki/Wikipedia:Upload?wpDestFile=Register_renaming:tag_indexed_scheme.png





In the renaming stage, every architectural register referenced (for read or write) is looked

up in an architecturally-indexed remap file. This file returns a tag and a ready bit. The

tag is non-ready if there is a queued instruction which will write to it that has not yetexecuted. For read operands, this tag takes the place of the architectural register in the

instruction. For every register write, a new tag is pulled from a free tag FIFO, and a new

mapping is written into the remap file, so that future instructions reading the architecturalregister will refer to this new tag. The tag is marked as unready, because the instruction

has not yet executed. The previous physical register allocated for that architectural

register is saved with the instruction in the reorder buffer, which is a FIFO that holdsthe instructions in program order between the decode and graduation stages.

The instructions are then placed in various issue queues.

As instructions are executed, the tags for their results are broadcast, and the issue queues

match these tags against the tags of their non-ready source operands. A match means that

the operand is ready. The remap file also matches these tags, so that it can mark the

corresponding physical registers as ready.

When all the operands to an instruction in an issue queue are ready, that instruction is

ready to issue. The issue queues pick ready instructions to send to the various functional

units each cycle. Non-ready instructions stay in the issue queues. This unordered removalof instructions from the issue queues is one of the things that makes them large and use

lots of power.

Issued instructions read from a tag-indexed physical register file (bypassing just-

broadcast operands), then execute.

Execution results are written to tag-indexed physical register file, as well as broadcast tothe bypass network preceding each functional unit.

Graduation puts the previous tag for the written architectural register into the free queue

so that it can be reused for a newly decoded instruction.

An exception or branch misprediction causes the remap file to back up to the remap stateat last valid instruction via combination of state snapshots and cycling through the

previous tags in the in-order pre-graduation queue. Since this mechanism is required, and

since it can recover any remap state (not just the state before the instruction currently being graduated), branch mispredictions can be handled before the branch reaches

graduation, potentially hiding the branch misprediction latency.

[ edit ] Details: reservation stations

Main article: reservation stations

This file is a candidate for speedy deletion. It may be deleted at any time. This is the style

used in the integer section of the AMD K7 and K8 designs.



http://en.wikipedia.org/wiki/Wikipedia:Upload?wpDestFile=Register_renaming:reservation_station_scheme.png



http://en.wikipedia.org/wiki/Wikipedia:Upload?wpDestFile=Register_renaming:reservation_station_scheme.png



In the renaming stage, every architectural register referenced for reads is looked up in

both the architecturally-indexed future file and the rename file. The future file read gives

the value of that register, if there is no outstanding instruction yet to write to it (i.e., it'sready). When the instruction is placed in an issue queue, the values read from the future

file are written into the corresponding entries in the reservation stations. Register writes

in the instruction cause a new, non-ready tag to be written into the rename file. The tagnumber is usually serially allocated in instruction order—no free tag FIFO is necessary.

Just as with the tag-indexed scheme, the issue queues wait for non-ready operands to see

matching tag broadcasts. Unlike the tag-indexed scheme, matching tags cause the

corresponding broadcast value to be written into the issue queue entry's reservationstation.

Issued instructions read their arguments from the reservation station, bypass just-

broadcast operands, and then execute. As mentioned earlier, the reservation station

register files are usually small, with perhaps eight entries.

Execution results are written to the reorder buffer , to the reservation stations (if the issue

queue entry has a matching tag), and to the future file if this is the last instruction to

target that architectural register (in which case register is marked ready).

Graduation copies the value from the reorder buffer into the architectural register file.The sole use of the architectural register file is to recover from exceptions and branch

mispredictions.

Exceptions and branch mispredictions, recognised at graduation, cause the architectural

file to be copied to the future file, and all registers marked as ready in the rename file.

There is usually no way to reconstruct the state of the future file for some instructionintermediate between decode and graduation, so there is usually no way to do early

recovery from branch mispredictions.

[ edit ] Comparison between the schemes

In both schemes, instructions are inserted in-order into the issue queues, but are removed

out-of-order. If the queues do not collapse empty slots, then they will either have many

unused entries, or require some sort of variable priority encoding for when multiple

instructions are simultaneously ready to go. Queues that collapse holes have simpler priority encoding, but require simple but large circuitry to advance instructions through

the queue.

Reservation stations have better latency from rename to execute, because the rename

stage finds the register values directly, rather than finding the physical register number,and then using that to find the value. This latency shows up as a component of the branch

mispredict latency.

http://en.wikipedia.org/wiki/Re-order_buffer







Reservation stations also have better latency from instruction issue to execution, because

each local register file is smaller than the large central file of the tag-indexed scheme.

Tag generation and exception processing are also simpler in the reservation stationscheme, as discussed below.

The physical register files used by reservation stations usually collapse unused entries in parallel with the issue queue they serve, which makes these register files larger in

aggregate, and burn more power, and more complicated than the simpler register filesused in a tag-indexed scheme. Worse yet, every entry in each reservation station can be

written by every result bus, so that a reservation-station machine with, e.g., 8 issue queue

entries per functional unit will typically have 9 times as many bypass networks as anequivalent tag-indexed machine. Result forwarding thus takes much more power and area

than in a tag-indexed design.

Furthermore, the reservation station scheme has four places (Future File, Reservation

Station, Reorder Buffer and Architectural File) where a result value can be stored, where

the tag-indexed scheme has just one (the physical register file). Because the results fromthe functional units, broadcast to all these storage locations, must reach a much larger

number of locations in the machine than in the tag-indexed scheme, this functionconsumes more power, area, and time. Still, in machines equipped with very accurate

branch prediction schemes and if execute latencies are a major concern, reservation

stations can work remarkably well.

[ edit ] History

The IBM System/360 Model 91 was an early machine that supported out-of-order

execution of instructions; it used the Tomasulo algorithm, which uses register renaming.

The POWER1 is the first microprocessor that used register renaming and out-of-order

execution in 1990.

The original R10000 design had neither collapsing issue queues nor variable priorityencoding, and suffered starvation problems as a result—the oldest instruction in the

queue would sometimes not be issued until both instruction decode stopped completely

for lack of rename registers, and every other instruction had been issued. Later revisionsof the design starting with the R12000 used a partially variable priority encoder to

mitigate this problem.

Early out-of-order machines did not separate the renaming and ROB/PRF storagefunctions. For that matter, some of the earliest, such as Sohi's RUU or the MetaflowDCAF, combined scheduling, renaming, and storage all in the same structure.

Most modern machines do renaming by RAM indexing a map table with the logical

register number. E.g., P6 did this; future files do this, and have data storage in the same

structure.




http://en.wikipedia.org/wiki/POWER1

http://en.wikipedia.org/wiki/Microprocessor


http://en.wikipedia.org/wiki/R10000#R12000




http://en.wikipedia.org/wiki/POWER1

http://en.wikipedia.org/wiki/Microprocessor


http://en.wikipedia.org/wiki/R10000#R12000



However, earlier machines used content-addressable memory (a type of hardware which

provides the functionality of an associative array) in the renamer. E.g., the HPSM RAT,

or Register Alias Table, essentially used a CAM on the logical register number incombination with different versions of the register.

In many ways, the story of out-of-order microarchitecture has been how these CAMshave been progressively eliminated. Small CAMs are useful; large CAMs are impractical.[citation needed ]

The P6 microarchitecture was the first Intel based processor that implemented both out-

of-order execution and register renaming. The P6 microarchitecture manifested in

Pentium Pro, Pentium II, Pentium III, Pentium M, Core, and Core 2 microprocessors.

Register

Basic concept [edit] In-order processors

In earlier processors, the processing of instructions is normally done in these steps:

1. Instruction fetch.

2. If input operands are available (in registers for instance), the instruction is

dispatched to the appropriate functional unit. If one or more operand isunavailable during the current clock cycle (generally because they are being

fetched from memory), the processor stalls until they are available.

3. The instruction is executed by the appropriate functional unit.4. The functional unit writes the results back to the register file.

[edit] Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:

1. Instruction fetch.2. Instruction dispatch to an instruction queue (also called instruction buffer or

reservation stations).

3. The instruction waits in the queue until its input operands are available. The

instruction is then allowed to leave the queue before earlier, older instructions.4. The instruction is issued to the appropriate functional unit and executed by that

unit.

5. The results are queued.6. Only after all older instructions have their results written back to the register file,

then this result is written back to the register file. This is called the graduation or

retire stage.

http://en.wikipedia.org/wiki/Content-addressable_memory


http://en.wikipedia.org/wiki/Associative_array

http://en.wikipedia.org/wiki/Wikipedia:Citation_needed


http://en.wikipedia.org/w/index.php?title=Out-of-order_execution&action=edit&section=3


http://en.wikipedia.org/wiki/Fetch-execute_cycle

http://en.wikipedia.org/wiki/Operand

http://en.wikipedia.org/wiki/Functional_unit


http://en.wikipedia.org/wiki/Computer_memory





http://en.wikipedia.org/wiki/Associative_array




http://en.wikipedia.org/wiki/Fetch-execute_cycle

http://en.wikipedia.org/wiki/Operand


http://en.wikipedia.org/wiki/Computer_memory






The key concept of OoO processing is to allow the processor to avoid a class of stalls that

occur when the data needed to perform an operation are unavailable. In the outline above,

the OoO processor avoids the stall that occurs in step (2) of the in-order processor whenthe instruction is not completely ready to be processed due to missing data.

OoO processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as

normal. The way the instructions are ordered in the original computer code is known as program order , in the processor they are handled in data order , the order in which the

data, operands, become available in the processor's registers. Fairly complex circuitry is

needed to convert from one ordering to the other and maintain a logical ordering of theoutput; the processor itself runs the instructions in seemingly random order.

The benefit of OoO processing grows as the instruction pipeline deepens and the speed

difference between main memory (or cache memory) and the processor widens. On

modern machines, the processor runs many times faster than the memory, so during the

time an in-order processor spends waiting for data to arrive, it could have processed alarge number of instructions.

[ edit ] Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues whichallows the dispatch step to be decoupled from the issue step and the graduation stage to

be decoupled from the execute stage. An early name for the paradigm was decoupled architecture. In the earlier in-order processors, these stages operated in a fairly lock-step, pipelined fashion.

To avoid false operand dependencies, which would decrease the frequency wheninstructions could be issued out of order, a technique called register renaming is used. In

this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register

can exist at the same time.

[ edit ] Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions andexceptions/traps. The results queue allows programs to be restarted after an exception,

which requires the instructions to be completed in program order. The queue allows

results to be discarded due to mispredictions on older branch instructions and exceptionstaken on older instructions.

The ability to issue instructions past branches which have yet to resolve is known as

speculative execution.



http://en.wikipedia.org/wiki/Main_memory


http://en.wikipedia.org/wiki/Cache_memory



http://en.wikipedia.org/wiki/Decoupled_architecture


http://en.wikipedia.org/wiki/Lock-step




http://en.wikipedia.org/wiki/Speculative_execution










http://en.wikipedia.org/wiki/Speculative_execution



[ edit ] Micro-architectural choices

• Are the instructions dispatched to a centralized queue or to multiple distributed

queues?

IBM PowerPC processors use queues which are distributed among the differentfunctional units while other Out-of-Order processors use a centralized queue.

IBM uses the term reservation stations for their distributed queues.

• Is there an actual results queue or are the results written directly into a register

file? For the latter, the queueing function is handled by register maps which holdthe register renaming information for each instruction in flight.

Early Intel out-of-order processors use a results queue called a re-order buffer ,

while most later Out-of-Order processors use register maps.

More precisely: Intel P6 family microprocessors have both a ROB re-order buffer

and a RAT register map mechanism. The ROB was motivated mainly by branchmisprediction recovery.

The Intel P6 family was among the earliest OoO processors, was supplanted bythe Intel Pentium 4 Willamette microarchitecture, but which returned after the

right hand turn and, at the time of writing (2009) is still Intel's flagship

microprocessor family.

[ edit ]

Data dependencyFrom Wikipedia, the free encyclopedia

(Redirected from Data dependence)Jump to: navigation, search

A data dependency in computer science is a situation in which a program statement

(instruction) refers to the data of a preceding statement. In compiler theory, the technique

used to discover data dependencies among statements (or instructions) is calleddependence analysis.

There are two types of dependencies: data and control.



http://en.wikipedia.org/wiki/PowerPC



http://en.wikipedia.org/wiki/P6_(microarchitecture)



http://en.wikipedia.org/wiki/Intel_Pentium_4

http://en.wikipedia.org/wiki/Pentium_4#Willamette


http://en.wikipedia.org/w/index.php?title=Data_dependence&redirect=no

http://en.wikipedia.org/wiki/Data_dependence#mw-head


http://en.wikipedia.org/wiki/Data_dependence#p-search

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Program_statement


http://en.wikipedia.org/wiki/Compiler_theory

http://en.wikipedia.org/wiki/Dependence_analysis




http://en.wikipedia.org/wiki/PowerPC





http://en.wikipedia.org/wiki/Intel_Pentium_4

http://en.wikipedia.org/wiki/Pentium_4#Willamette


http://en.wikipedia.org/w/index.php?title=Data_dependence&redirect=no


http://en.wikipedia.org/wiki/Data_dependence#p-search

http://en.wikipedia.org/wiki/Computer_science


http://en.wikipedia.org/wiki/Compiler_theory




Contents

[hide]

• 1 Data dependencies

o 1.1 True dependencyo 1.2 Anti-dependency

o 1.3 Output dependency

• 2 Control Dependency

• 3 Implications

• 4 References

[ edit ] Data dependencies

Assuming statement S1 and S2, S2 depends on S1 if:

[I(S1) ∩ O(S2)] ∪ [O(S1) ∩ I(S2)] ∪ [O(S1) ∩ O(S2)] ≠ Ø

where:

• I(Si) is the set of memory locations read by Si and

• O(Sj) is the set of memory locations written by Sj

• and there is a feasible run-time execution path from S1 to S2

This Condition is called Bernstein Condition, named by A. J. Bernstein.

Three cases exist:

• True (data) dependence: O(S1) ∩ I (S2) , S1-> S2 and S1 writes something read

by S2

• Anti-dependence: I(S1) ∩ O(S2) , mirror relationship of true dependence

• Output dependence: O(S1) ∩ O(S2), S1->S2 and both write the same memory

location.

[edit] True dependency

A true dependency, also known as a data dependency, occurs when an instruction

depends on the result of a previous instruction:

1. A = 32. B = A3. C = B

Instruction 3 is truly dependent on instruction 2, as the final value of C depends on the

instruction updating B. Instruction 2 is truly dependent on instruction 1, as the final value


http://en.wikipedia.org/wiki/Data_dependence#Data_dependencies

http://en.wikipedia.org/wiki/Data_dependence#True_dependency

http://en.wikipedia.org/wiki/Data_dependence#Anti-dependency

http://en.wikipedia.org/wiki/Data_dependence#Output_dependency

http://en.wikipedia.org/wiki/Data_dependence#Control_Dependency

http://en.wikipedia.org/wiki/Data_dependence#Implications

http://en.wikipedia.org/wiki/Data_dependence#References

http://en.wikipedia.org/w/index.php?title=Data_dependency&action=edit&section=1



http://en.wikipedia.org/wiki/Data_dependence#Data_dependencies

http://en.wikipedia.org/wiki/Data_dependence#True_dependency

http://en.wikipedia.org/wiki/Data_dependence#Anti-dependency

http://en.wikipedia.org/wiki/Data_dependence#Output_dependency

http://en.wikipedia.org/wiki/Data_dependence#Control_Dependency

http://en.wikipedia.org/wiki/Data_dependence#Implications

http://en.wikipedia.org/wiki/Data_dependence#References





of B depends on the instruction updating A. Since instruction 3 is truly dependent upon

instruction 2 and instruction 2 is truly dependent on instruction 1, instruction 3 is also

truly dependent on instruction 1. Instruction level parallelism is therefore not an option inthis example. [1]

[edit] Anti-dependency

An anti-dependency occurs when an instruction requires a value that is later updated. In

the following example, instruction 3 anti-depends on instruction 2 — the ordering of these instructions cannot be changed, nor can they be executed in parallel (possibly

changing the instruction ordering), as this would affect the final value of A.

1. B = 32. A = B + 13. B = 7

An anti-dependency is an example of a name dependency. That is, renaming of variables

could remove the dependency, as in the next example:

1. B = 3N. B2 = B2. A = B2 + 13. B = 7

A new variable, B2, has been declared as a copy of B in a new instruction, instruction N.The anti-dependency between 2 and 3 has been removed, meaning that these instructions

may now be executed in parallel. However, the modification has introduced a new

dependency: instruction 2 is now truly dependent on instruction N, which is trulydependent upon instruction 1. As true dependencies, these new dependencies are

impossible to safely remove. [1]

[edit] Output dependency

An output dependency occurs when the ordering of instructions will affect the finaloutput value of a variable. In the example below, there is an output dependency between

instructions 3 and 1 — changing the ordering of instructions in this example will change

the final value of B, thus these instructions cannot be executed in parallel.

1 A = 2 * X2 B = A / 3

3 A = 9 * Y

As with anti-dependencies, output dependencies are name dependencies. That is, they

may be removed through renaming of variables, as in the below modification of the

above example:

1 A2 = 2 * X2 B = A2 /3

http://en.wikipedia.org/wiki/Instruction_level_parallelism

http://en.wikipedia.org/wiki/Data_dependence#cite_note-architecture-0




http://en.wikipedia.org/wiki/Instruction_level_parallelism







3 A = 9 * Y

A commonly used naming convention for data dependencies is the following: Read-after-

Write (true dependency), Write-after-Write (output dependency), and Write-After-Read

(anti-dependency). [1]

[ edit ] Control Dependency

An instruction B is control dependent on a preceding instruction A if the latter determines

whether B should execute or not. In the following example, instruction 2 is control

dependent on instruction 1.

1. if a == b goto AFTER2. A = 2 * X3. AFTER:

Intuitively, there is control dependence between two statements S1 and S2 if

• S1 could be possibly executed before S2

• The outcome of S1 execution will determine whether S2 will be executed.

A typical example is that there is control dependence between if statement's condition

part and the statements in the corresponding true/false bodies.

A formal definition of control dependence can be presented as follows:

A statement S2 is said to be control dependent on another statement S1 iff

• there exists a path P from S1 to S2 such that every statement Si ≠ S1 within P will be followed by S2 in each possible path to the end of the program and

• S1 will not necessarily be followed by S2, i.e. there is an execution path from S1to the end of the program that does not go through S2.

Expressed with the help of (post-)dominance the two conditions are equivalent to

• S2 post-dominates all Si

• S2 does not post-dominate S1

[ edit ] Implications

Conventional programs are written assuming the sequential execution model. Under this

model, instructions execute one after the other, atomically (i.e., at any given point of time

only one instruction is executed) and in the order specified by the program.

However, dependencies among statements or instructions may hinder parallelism — parallel execution of multiple instructions, either by a parallelizing compiler or by a

processor exploiting instruction level parallelism. Recklessly executing multiple



http://en.wikipedia.org/wiki/If_and_only_if


http://en.wikipedia.org/w/index.php?title=Sequential_execution_model&action=edit&redlink=1




http://en.wikipedia.org/wiki/If_and_only_if





instructions without considering related dependences may cause danger of getting wrong

results, namely hazards.

Reservation Stations are decentralized features of the microarchitecture of a CPU thatallow for register renaming, and are used by the Tomasulo algorithm for dynamic

instruction scheduling.

Reservation stations permit the CPU to fetch and re-use a data value as soon as it has

been computed, rather than waiting for it to be stored in a register and re-read. Wheninstructions are issued, they can designate the reservation station from which they want

their input to read. When multiple instructions need to write to the same register, all can

proceed and only the (logically) last one need actually be written. It checks if the

operands are available (RAW) and if execution unit is free (Structural hazard) beforestarting execution.

Instruction are stored with available parameters, and executed when ready. Results are

identified by the unit that will execute the corresponding instruction. Implicitly register renaming solves WAR and WAW hazards. Since this is a fully-associative structure, it

has a very high cost in comparators (need to compare all results returned from processing

units with all stored addresses).

In Tomasulo's algorithm, instructions are issued in sequence to Reservation Stationswhich buffer the instruction as well as the operands of the instruction. If the operand is

not available, the Reservation Station listens on a Common Data Bus for the operand to

become available. When the operand becomes available, the Reservation Station buffers

it, and the execution of the instruction can begin.

Functional Units (such as an adder or a multiplier), each have their own correspondingReservation Station. The output of the Functional Unit connects to the Common Data

Bus, where Reservation Stations are listening for the operands they need.

Scoreboarding



Scoreboarding is a centralized method, used in the CDC 6600 computer , for

dynamically scheduling a pipeline so that the instructions can execute out of order whenthere are no conflicts and the hardware is available. In a scoreboard, the data

dependencies of every instruction are logged. Instructions are released only when the

scoreboard determines that there are no conflicts with previously issued and incompleteinstructions. If an instruction is stalled because it is unsafe to continue, the scoreboard

monitors the flow of executing instructions until all dependencies have been resolved

before the stalled instruction is issued.











http://en.wikipedia.org/wiki/Structural_hazard




http://en.wikipedia.org/wiki/Scoreboarding#mw-head


http://en.wikipedia.org/wiki/Scoreboarding#p-search

http://en.wikipedia.org/wiki/CDC_6600

http://en.wikipedia.org/wiki/Computer






http://en.wikipedia.org/wiki/Data_dependency









http://en.wikipedia.org/wiki/Structural_hazard




http://en.wikipedia.org/wiki/Scoreboarding#p-search

http://en.wikipedia.org/wiki/CDC_6600








Contents

[hide]

• 1 Stages

• 2 Data structure• 3 The algorithm• 4 Remarks

• 5 External links

• 6 See also

[ edit ] Stages

Instructions are decoded in order and go through the following four stages.

1. Issue: The system checks which registers will be read and written by thisinstruction. This information is remembered as it will be needed in the following

stages. In order to avoid output dependencies (WAW - Write after Write) the

instruction is stalled until instructions intending to write to the same register are

completed. The instruction is also stalled when required functional units arecurrently busy.

2. Read operands: After an instruction has been issued and correctly allocated to

the required hardware module, the instruction waits until all operands becomeavailable. This procedure resolves read dependencies (RAW - Read after Write)

because registers which are intended to be written by another instruction are not

considered available until they are actually written.

3. Execution: When all operands have been fetched, the functional unit starts itsexecution. After the result is ready, the scoreboard is notified.

4. Write Result: In this stage the result is about to be written to its destination

register. However, this operation is delayed until earlier instructions—whichintend to read registers this instruction wants to write to—have completed their read operands stage. This way, so called data dependencies (WAR - Write after

Read) can be addressed.

[ edit ] Data structure

To control the execution of the instructions, the scoreboard maintains three status tables:

• Instruction Status: Indicates, for each instruction being executed, which of the

four stages it is in.

• Functional Unit Status: Indicates the state of each functional unit. Each functionunit maintains 9 fields in the table:

o Busy: Indicates whether the unit is being used or not

o Op: Operation to perform in the unit (e.g. MUL, DIV or MOD)

o Fi: Destination register


http://en.wikipedia.org/wiki/Scoreboarding#Stages

http://en.wikipedia.org/wiki/Scoreboarding#Data_structure

http://en.wikipedia.org/wiki/Scoreboarding#The_algorithm

http://en.wikipedia.org/wiki/Scoreboarding#Remarks

http://en.wikipedia.org/wiki/Scoreboarding#External_links

http://en.wikipedia.org/wiki/Scoreboarding#See_also

http://en.wikipedia.org/w/index.php?title=Scoreboarding&action=edit&section=1

http://en.wikipedia.org/wiki/Data_dependency#Output_dependency


http://en.wikipedia.org/wiki/Data_dependency#RAW_-_Read_After_Write


http://en.wikipedia.org/wiki/Data_dependency#WAR_-_Write_After_Read



http://en.wikipedia.org/wiki/Scoreboarding#Stages

http://en.wikipedia.org/wiki/Scoreboarding#Data_structure

http://en.wikipedia.org/wiki/Scoreboarding#The_algorithm

http://en.wikipedia.org/wiki/Scoreboarding#Remarks

http://en.wikipedia.org/wiki/Scoreboarding#External_links

http://en.wikipedia.org/wiki/Scoreboarding#See_also




http://en.wikipedia.org/wiki/Data_dependency#WAR_-_Write_After_Read




o F j,Fk : Source-register numbers

o Q j,Qk : Functional units that will produce the source registers F j, Fk

o R j,R k : Flags that indicates when F j, Fk are ready

• Register Status: Indicates, for each register, which function unit will write results

into it.

[ edit ] The algorithm

The detailed algorithm for the scoreboard control is described below:

function issue(op, dst, src1, src2)wait until (!Busy[FU] AND !Result[dst]); // FU can be any

functional unit that can execute operation op

Busy[FU] ← Yes;Op[FU] ← op;Fi[FU] ← dst;Fj[FU] ← src1;Fk[FU] ← src2;Qj[FU] ← Result[src1];Qk[FU] ← Result[src2];Rj[FU] ← not Qj;Rk[FU] ← not Qk;Result[dst] ← FU;

function read_operands(FU )

wait until (Rj[FU ] AND Rk[FU ]);Rj[FU ] ← No;Rk[FU ] ← No;

function execute(FU )

// Execute whatever FU must do

function write_back(FU )

wait until ( f {(Fj[f]≠Fi[FU ] OR Rj[f]=No) AND (Fk[f]≠Fi[FU ] ORRk[f]=No)})

foreach f doif Qj[f]=FU then Rj[f] ← Yes;if Qk[f]=FU then Rk[f] ← Yes;

Result[Fi[FU ]] ← 0;Busy[FU ] ← No;

[ edit ] Remarks

The scoreboarding method must stall the issue stage when there is no functional unitavailable. In this case, future instructions that could potentially be executed will wait

until the structural hazard is resolved. Some other techniques like Tomasulo algorithm can avoid the structural hazard and also resolve WAR and WAW dependencies with

Register renaming.










a hazard is created whenever there is a

Documents