03 dynamic sched

Upload: rekha-mudalagiri

Post on 04-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 03 Dynamic Sched

    1/84

    Computer Architecture

    Instruction Level Parallelism

    S. G. Shivaprasad Yadav

  • 7/30/2019 03 Dynamic Sched

    2/84

    8/21/2011 2

    Review from Last Time

    4 papers: All about where to draw line between HW and SW

    IBM set foundations for ISAs since 1960s 8-bit byte

    Byte-addressable memory (as opposed to word-addressable memory) 32-bit words

    Two's complement arithmetic (but not the first processor)

    32-bit (SP) / 64-bit (DP) Floating Point format and registers

    Commercial use of microcoded CPUs Binary compatibility / computer family

    B5000 very different model: HLL only, stack, Segmented VM

    IBM paper made case for ISAs good for microcoded

    processors leading to CISC Berkeley paper made the case for ISAs for pipeline + cache

    micrprocessors (VLSI) leading to RISC

    Who won RISC vs. CISC? VAX is dead. Intel 80x86 on desktop,RISC in embedded, Servers x86 and RISC

  • 7/30/2019 03 Dynamic Sched

    3/84

    8/21/2011 3

    Outline

    ILP

    Compiler techniques to increase ILP

    Loop Unrolling Static Branch Prediction

    Dynamic Branch Prediction

    Overcoming Data Hazards with DynamicScheduling

    (Start) Tomasulo Algorithm

    Conclusion

  • 7/30/2019 03 Dynamic Sched

    4/848/21/2011 4

    Recall from Pipelining Review

    Pipeline CPI = Ideal pipeline CPI + StructuralStalls + Data Hazard Stalls + Control Stalls

    Ideal pipeline CPI: measure of the maximumperformance attainable by the implementation

    Structural hazards: HW cannot support thiscombination of instructions

    Data hazards: Instruction depends on result of priorinstruction still in the pipeline

    Control hazards: Caused by delay between the fetchingof instructions and decisions about changes in control

    flow (branches and jumps)

  • 7/30/2019 03 Dynamic Sched

    5/84

    8/21/2011 5

    Instruction Level Parallelism

    Instruction-Level Parallelism (ILP): overlap theexecution of instructions to improveperformance

    2 approaches to exploit ILP:1) Rely on hardware to help discover and exploit the parallelism

    dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and

    2) Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2)

    Next 4 lectures on this topic

  • 7/30/2019 03 Dynamic Sched

    6/84

    8/21/2011 6

    Instruction-Level Parallelism (ILP)

    Basic Block (BB) ILP is quite smallBB: a straight-line code sequence with no branches in

    except to the entry and no branches out except at the exitFor a typical MIPS programs, the average dynamic branch

    frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branchesSince these instructions in BB likely to depend on each

    other, the amount of overlap we can exploit within a basicblock is likely to be less than the average BB size.

    To obtain substantial performanceenhancements, we must exploit ILP acrossmultiple basic blocks

    Simplest: loop-level parallelism to exploitparallelism among iterations of a loop. E.g.,for (i=1; i

  • 7/30/2019 03 Dynamic Sched

    7/84

    8/21/2011 7

    Loop-Level Parallelism

    Exploit loop-level parallelism to parallelism byunrolling loop either by

    1. dynamic via branch prediction or

    2. static via loop unrolling by compiler(Another way is vectors, to be covered later) Determining instruction dependence is critical to

    Loop Level Parallelism

    To exploit ILP we must determine whichinstructions can be executed in parallel If 2 instructions are

    parallel, they can execute simultaneously in apipeline of arbitrary depth without causing anystalls (assuming no structural hazards)

    dependent, they are not parallel and must beexecuted in order, although they may often be

    partially overlapped

  • 7/30/2019 03 Dynamic Sched

    8/84

    8/21/2011 8

    Three different types of dependencies datadependencies, name dependencies, and controldependencies

    InstrJ is data dependent (aka true dependence) onInstrI: if either of the following holds1. Instr i produces a result that may be used by instruction j or

    2. Instr j is data dependent on Instr k, and instruction k is datadependent on Instr I

    Data Dependence and Hazards

    I: addr1,r2,r3

    J: sub r4,r1,r3

  • 7/30/2019 03 Dynamic Sched

    9/84

    8/21/2011 9

    Data Dependence and Hazards

    Example: Consider the following MIPS codesequence that increments vector of values inmemory (starting at 0(R1), and with the last

    element at 8(R2)), by a scalar in register F2.

  • 7/30/2019 03 Dynamic Sched

    10/84

    8/21/2011 10

    Data Dependence and Hazards

    Both of the above dependent sequences, as shown by the arrows,have each instruction depending on the previous one.

    The arrows here show the order that must be preserved for correctexecution.

    The arrow points from an instruction that must precede the instructionthat the arrowhead points to. If two instructions are data dependent, they cannot execute

    simultaneously or be completely overlapped The dependence implies that there would be a chain of one or more

    data hazards between the two instructions. Executing the instructions simultaneously will cause a processor with

    pipeline interlocks (and a pipeline depth longer than the distancebetween the instructions in cycles) to detect a hazard and stall,thereby reducing or eliminating the overlap.

    Data dependence in instruction sequencedata dependence in source code effect of original data

    dependence must be preserved If data dependence caused a hazard in pipeline,

    called a Read After Write (RAW) hazard

  • 7/30/2019 03 Dynamic Sched

    11/84

    8/21/2011 11

    ILP and Data Dependencies,Hazards

    HW/SW must preserve program order:order instructions would execute in if executedsequentially as determined by original source program

    Dependences are a property of programs Presence of dependence indicates potential for a

    hazard, but actual hazard and length of any stall isproperty of the pipeline

    Importance of the data dependencies1) indicates the possibility of a hazard

    2) determines order in which results must be calculated

    3) sets an upper bound on how much parallelism can possibly beexploited

    HW/SW goal: exploit parallelism by preserving program

    order only where it affects the outcome of the program

  • 7/30/2019 03 Dynamic Sched

    12/84

    8/21/2011 12

    Data Dependencies,Hazards

    A dependence can be overcome in two differentways: maintaining the dependence but avoiding a hazard, and

    eliminating a dependence by transforming the code.

    Scheduling the code is the primary method used toavoid a hazard without altering a dependence, and

    such scheduling can be done both by the compilerand by the hardware.

  • 7/30/2019 03 Dynamic Sched

    13/84

    8/21/2011 13

    Name dependence: when 2 instructions use sameregister or memory location, called a name, but no

    flow of data between the instructions associatedwith that name; 2 versions of name dependence

    InstrJ writes operand beforeInstrI reads it

    Called an anti-dependence by compiler writers.This results from reuse of the name r1

    If anti-dependence caused a hazard in the pipeline,

    called a Write After Read (WAR) hazard

    I: sub r4,r1,r3J: addr1,r2,r3

    K: mul r6,r1,r7

    Name Dependence #1: Anti-dependence

  • 7/30/2019 03 Dynamic Sched

    14/84

    8/21/2011 14

    Name Dependence #2: Output dependence

    InstrJ writes operand beforeInstrI writes it.

    Called an output dependence by compiler writers

    This also results from the reuse of name r1 If output-dependence caused a hazard in the pipeline,

    called a Write After Write (WAW) hazard

    Instructions involved in a name dependence canexecute simultaneously if name used in instructions ischanged so instructions do not conflict Register renaming resolves name dependence for regs

    Either by compiler or by HW

    I: sub r1,r4,r3

    J: addr1,r2,r3K: mul r6,r1,r7

  • 7/30/2019 03 Dynamic Sched

    15/84

    8/21/2011 15

    Data Hazards

    A hazard is created whenever there is a dependence betweeninstructions, and they are close enough that the overlap duringexecution would change the order of access to the operandinvolved in the dependence.

    Because of the dependence, we must preserve program order,that is, the order that the instructions would execute in if executedsequentially one at a time as determined by the original sourceprogram.

    The goal of both software and hardware techniques is to exploitparallelism by preserving program order only where it affects

    the outcome of the program. Detecting and avoiding hazards ensures that necessary program

    order is preserved. Data hazards may be classified as one of three types, depending

    on the order of read and write accesses in the instructions.Consider two instructions i and j, with i preceding j in programorder.

    The possible data hazards are RAW (read after write) WAW (write after write) WAR (write after read)

  • 7/30/2019 03 Dynamic Sched

    16/84

    8/21/2011 16

    Data Hazards

    RAW (read after write) jtries to read a source before iwrites it, sojincorrectly gets the oldvalue.

    This hazard is the most common type and corresponds to a true datadependence.

    Program order must be preserved to ensure that j receives the value from i.

    WAW (write after write) jtries to write an operand before it is written by i. The writes end up being

    performed in the wrong order, leaving the value written by irather than thevalue written byjin the destination.

    This hazard corresponds to an output dependence. WAW hazards are present only in pipelines that write in more than one pipe

    stage or allow an instruction to proceed even when a previous instruction isstalled.

    WAR (write after read) j tries to write a destination before it is read by i, so i incorrectly gets thenewvalue. This hazard arises from an anti-dependence. WAR hazards cannot occur in most static issue pipelineseven deeper

    pipelines or floating-point pipelinesbecause all reads are early (in ID) andall writes are late (in WB).

    A WAR hazard occurs either when there are some instructions that writeresults early in the instruction pipeline and other instructions that read asource late in the pipeline, or when instructions are reordered.

  • 7/30/2019 03 Dynamic Sched

    17/84

    8/21/2011 17

    Control Dependencies

    A control dependence determines the ordering of an instruction,i, with respect to a branch instruction so that the instruction i isexecuted in correct program order and only when it should be.

    Every instruction is control dependent on some set ofbranches, and, in general, these control dependencies mustbe preserved to preserve program order

    One of the simplest examples of a control dependence is thedependence of the statements in the "then" part of an ifstatement on the branch.

    For example, in the code segmentif p1 {S1;

    };

    if p2 {S2;

    }

    S1 is control dependent onp1, and S2 is control dependentonp2 but not onp1.

  • 7/30/2019 03 Dynamic Sched

    18/84

    8/21/2011 18

    Control Dependencies

    In general, there are two constraints imposed bycontrol dependences:1. An instruction that is control dependent on a branch cannot be

    moved before the branch so that its execution is no longercontrolled by the branch. For example, we cannot take aninstruction from the then portion of an if statement and move itbefore the if statement.

    2. An instruction that is not control dependent on a branch cannot be

    moved afterthe branch so that its execution is controlledby thebranch. For example, we cannot take a statement before the ifstatement and move it into the then portion.

  • 7/30/2019 03 Dynamic Sched

    19/84

    8/21/2011 19

    Control Dependence Ignored

    Control dependence need not bepreserved

    willing to execute instructions that should not have beenexecuted, thereby violating the control dependences, ifcan do so without affecting correctness of the program

    Instead, 2 properties critical to programcorrectness are

    1) exception behavior and

    2) data flow

  • 7/30/2019 03 Dynamic Sched

    20/84

    8/21/2011 20

    Exception Behavior

    Preserving exception behaviorany changes in instruction execution order

    must not change how exceptions are raised in

    program( no new exceptions)

    Example:DADDU R2,R3,R4BEQZ R2,L1LW R1,0(R2)

    L1:

    (Assume branches not delayed)

    Problem with moving LWbefore BEQZ?

  • 7/30/2019 03 Dynamic Sched

    21/84

    8/21/2011 21

    Data Flow

    Data flow: actual flow of data values amonginstructions that produce results and those that

    consume them branches make flow dynamic, determine which instruction issupplier of data

    Example:

    DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R5,R6

    L: OR R7,R1,R8

    ORdepends on DADDU or DSUBU?Must preserve data flow on execution

  • 7/30/2019 03 Dynamic Sched

    22/84

    8/21/2011 22

    Outline

    ILP

    Compiler techniques to increase ILP

    Loop Unrolling

    Static Branch Prediction

    Dynamic Branch Prediction

    Overcoming Data Hazards with DynamicScheduling

    (Start) Tomasulo Algorithm

    Conclusion

    B i Pi li S h d li d L

  • 7/30/2019 03 Dynamic Sched

    23/84

    8/21/2011 23

    Basic Pipeline Scheduling and LoopUnrolling

    To keep a pipeline full, parallelism among instructions must beexploited by finding sequences of unrelated instructions that canbe overlapped in the pipeline.

    To avoid a pipeline stall, a dependent instruction must be

    separated from the source instruction by a distance in clockcycles equal to the pipeline latency of that source instruction.

    A compiler's ability to perform this scheduling depends both onthe amount of ILP available in the program and on the latenciesof the functional units in the pipeline.

    We assume the standard five-stage integer pipeline, so thatbranches have a delay of 1 clock cycle.

    We assume that the functional units are fully pipelined orreplicated (as many times as the pipeline depth), so that an

    operation of any type can be issued on every clock cycle andthere are no structural hazards.

    Now we look at how the compiler can increase the amount ofavailable ILP by transforming loops

  • 7/30/2019 03 Dynamic Sched

    24/84

    8/21/2011 24

    Software Techniques - Example

    Example: This code, add a scalar to a vector:for (i=1000; i>0; i=i1)

    x[i] = x[i] + s;

    Lets look at the performance of this loop, showing how wecan use the parallelism to improve its performance for aMIPS pipeline with latencies shown below Ignore delayed branch in these examples

    Instruction Instruction Latency stalls between producing result using result in cycles in cycles

    FP ALU op Another FP ALU op 4 3

    FP ALU op Store double 3 2

    Load double FP ALU op 1 1

    Load double Store double 1 0

    Integer op Integer op 1 0

  • 7/30/2019 03 Dynamic Sched

    25/84

    8/21/2011 25

    FP Loop: Where are the Hazards?

    Loop: L.D F0,0(R1);F0=vector element

    ADD.D F4,F0,F2;add scalar from F2

    S.D 0(R1),F4;store result

    DADDUI R1,R1,-8;decrement pointer 8B (DW)

    BNEZ R1,Loop ;branch R1!=zero

    First translate into MIPS code:-To simplify, assume 8 is lowest address

  • 7/30/2019 03 Dynamic Sched

    26/84

    8/21/2011 26

    FP Loop Showing Stalls

    9 clock cycles: Rewrite code to minimize stalls?

    Instruction Instruction Latency in

    producing result using result clock cyclesFP ALU op Another FP ALU op 3

    FP ALU op Store double 2

    Load double FP ALU op 1

    1 Loop: L.D F0,0(R1) ;F0=vector element

    2 stall

    3 ADD.D F4,F0,F2 ;add scalar in F2

    4 stall

    5 stall

    6 S.D 0(R1),F4 ;store result

    7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)

    8 stall ;assumes cant forward to branch

    9 BNEZ R1,Loop ;branch R1!=zero

  • 7/30/2019 03 Dynamic Sched

    27/84

    8/21/2011 27

    Revised FP Loop Minimizing Stalls

    7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loopoverhead; How make faster?

    Instruction Instruction Latency in producing result using result clock cycles

    FP ALU op Another FP ALU op 3

    FP ALU op Store double 2

    Load double FP ALU op 1

    Swap DADDUI and S.D by changing address of S.D

    1 Loop: L.D F0,0(R1)

    2 DADDUI R1,R1,-8

    3 ADD.D F4,F0,F2

    4 stall

    5 stall

    6 S.D 8(R1),F4 ;altered offset when move DSUBUI

    7 BNEZ R1,Loop

  • 7/30/2019 03 Dynamic Sched

    28/84

    8/21/2011 28

    Loop Unrolling

    A simple scheme for increasing the number of instructionsrelative to the branch and overhead instructions is loopunrolling.

    Unrolling simply replicates the loop body multiple times,adjusting the loop termination code.

    Loop unrolling can also be used to improve scheduling.

    Because it eliminates the branch, it allows instructions from

    different iterations to be scheduled together. In this case, we can eliminate the data use stalls by creating

    additional independent instructions within the loop body.

    If we simply replicated the instructions when we unrolled the

    loop, the resulting use of the same registers could prevent usfrom effectively scheduling the loop.

    Thus, we will want to use different registers for each iteration,increasing the required number of registers.

    Unroll Loop Four Times

  • 7/30/2019 03 Dynamic Sched

    29/84

    8/21/2011 29

    Unroll Loop Four Times(straightforward way)

    Rewrite loop tominimize stalls?

    1 Loop:L.D F0,0(R1)

    3 ADD.D F4,F0,F2

    6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ

    7 L.D F6,-8(R1)9 ADD.D F8,F6,F2

    12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ

    13 L.D F10,-16(R1)

    15 ADD.D F12,F10,F2

    18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ

    19 L.D F14,-24(R1)

    21 ADD.D F16,F14,F2

    24 S.D -24(R1),F16

    25 DADDUI R1,R1,#-32 ;alter to 4*826 BNEZ R1,LOOP

    27 clock cycles, or 6.75 per iteration

    (Assumes R1 is multiple of 4)

    1 cycle stall

    2 cycles stall

  • 7/30/2019 03 Dynamic Sched

    30/84

    8/21/2011 30

    Unrolled Loop Detail

    Do not usually know upper bound of loop

    Suppose it is n, and we would like to unroll theloop to make k copies of the body

    Instead of a single unrolled loop, we generate apair of consecutive loops: 1st executes (n mod k) times and has a body that is the

    original loop

    2nd is the unrolled body surrounded by an outer loop thatiterates (n/k) times

    For large values of n, most of the execution time

    will be spent in the unrolled loop

  • 7/30/2019 03 Dynamic Sched

    31/84

    8/21/2011 31

    Unrolled Loop That Minimizes Stalls

    1 Loop:L.D F0,0(R1)

    2 L.D F6,-8(R1)

    3 L.D F10,-16(R1)

    4 L.D F14,-24(R1)

    5 ADD.D F4,F0,F2

    6 ADD.D F8,F6,F2

    7 ADD.D F12,F10,F2

    8 ADD.D F16,F14,F2

    9 S.D 0(R1),F410 S.D -8(R1),F8

    11 S.D -16(R1),F12

    12 DSUBUI R1,R1,#32

    13 S.D 8(R1),F16 ; 8-32 = -2414 BNEZ R1,LOOP

    14 clock cycles, or 3.5 per iteration

  • 7/30/2019 03 Dynamic Sched

    32/84

    8/21/2011 32

    5 Loop Unrolling Decisions

    Requires understanding how one instruction depends onanother and how the instructions can be changed orreordered given the dependences:

    1. Determine loop unrolling useful by finding that loop

    iterations were independent (except for maintenance code)

    2. Use different registers to avoid unnecessary constraintsforced by using same registers for different computations

    3. Eliminate the extra test and branch instructions and adjustthe loop termination and iteration code

    4. Determine that loads and stores in unrolled loop can beinterchanged by observing that loads and stores from

    different iterations are independent Transformation requires analyzing memory addresses and findingthat they do not refer to the same address

    5. Schedule the code, preserving any dependences neededto yield the same result as the original code

    3 Li i L U lli

  • 7/30/2019 03 Dynamic Sched

    33/84

    8/21/2011 33

    3 Limits to Loop Unrolling

    1. Decrease in amount of overhead amortized witheach extra unrolling Amdahls Law

    2. Growth in code size For larger loops, concern it increases the instruction cache

    miss rate

    3. Register pressure: potential shortfall inregisters created by aggressive unrolling andscheduling If not be possible to allocate all live values to registers, may

    lose some or all of its advantage Loop unrolling reduces impact of branches on

    pipeline; another way is branch prediction

    St ti B h P di ti

  • 7/30/2019 03 Dynamic Sched

    34/84

    8/21/2011 34

    12%

    22%

    18%

    11%12%

    4%6%

    9%10%

    15%

    0%

    5%

    10%

    15%

    20%

    25%

    compres

    s

    eqnto

    tt

    espress

    ogcc li

    dodu

    c

    ear

    hydro2

    d

    mdljdp

    su2cor

    MispredictionRate

    To reorder code around branches, need to predict branchstatically when compile

    Simplest scheme is to predict a branch as taken This scheme has an average misprediction rate that is equal to the untaken

    branch frequency, which for the SPEC programs is 34%. Unfortunately, the misprediction rate for the SPEC programs ranges from not

    very accurate (59%) to highly accurate (9%)

    More accuratescheme predictsbranches usingprofileinformation

    collected fromearlier runs, andmodifyprediction

    based on lastrun:

    Integer Floating Point

    Static Branch Prediction

    St ti B h P di ti

  • 7/30/2019 03 Dynamic Sched

    35/84

    8/21/2011 35

    Static Branch Prediction

    The key observation that makes this worthwhile is that thebehavior of branches is often bimodally distributed; that is, anindividual branch is often highly biased toward taken or untaken.

    The same input data were used for runs and for collecting the

    profile; other studies have shown that changing the input so thatthe profile is for a different run leads to only a small change inthe accuracy of profile-based prediction.

    The effectiveness of any branch prediction scheme depends

    both on the accuracy of the scheme and the frequency ofconditional branches, which vary in SPEC from 3% to 24%.

    The fact that the misprediction rate for the integer programs ishigher and that such programs typically have a higher branch

    frequency is a major limitation for static branch prediction.

    D i B h P di ti

  • 7/30/2019 03 Dynamic Sched

    36/84

    8/21/2011 36

    Dynamic Branch Prediction

    Why does prediction work? Underlying algorithm has regularities

    Data that is being operated on has regularities

    Instruction sequence has redundancies that are artifacts ofway that humans/compilers think about problems

    Is dynamic branch prediction better than staticbranch prediction?

    Seems to be

    There are a small number of important branches in programswhich have dynamic behavior

    Dynamic Branch Prediction

  • 7/30/2019 03 Dynamic Sched

    37/84

    8/21/2011 37

    Dynamic Branch Prediction

    Performance = (accuracy, cost of misprediction) The simplest dynamic branch-prediction scheme is a

    branch-prediction bufferor branch history table.

    A branch-prediction buffer is a small memory indexed by the lowerportion of the address of the branch instruction.

    The memory contains a bit that says whether the branch was recentlytaken or not.

    This scheme is the simplest sort of buffer; it has no tags and is usefulonly to reduce the branch delay when it is longer than the time tocompute the possible target PCs.

    Problem: in a loop, 1-bit BHT will cause two

    mispredictions (avg is 9 iterations before exit): End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it

    predicts exit instead of looping

    Dynamic Branch Prediction

  • 7/30/2019 03 Dynamic Sched

    38/84

    8/21/2011 38

    Solution: 2-bit scheme where change predictiononly if get mis-prediction twice

    Red: stop, not taken

    Green: go, taken

    Adds hysteresisto decision making process

    Dynamic Branch Prediction

    T

    T NT

    NT

    Predict Taken

    Predict NotTaken

    Predict Taken

    Predict NotTaken

    11 10

    01 00T

    NT

    T

    NT

    BHT Accuracy

  • 7/30/2019 03 Dynamic Sched

    39/84

    8/21/2011 39

    BHT Accuracy

    Mispredict because either: Wrong guess for that branch

    Got branch history of wrong branch when index the table

    4096 entry table:

    Correlated Branch Prediction

  • 7/30/2019 03 Dynamic Sched

    40/84

    8/21/2011 40

    Correlated Branch Prediction

    Branch predictors that use the behavior of other branches tomake a prediction are called correlating predictors or two-level predictors.

    Existing correlating predictors add information about thebehavior of the most recent branches to decide how to predict a

    given branch. Idea: record mmost recently executed branches as taken

    or not taken, and use that pattern to select the proper n-bitbranch history table

    In general, (m,n) predictor means record last mbranches toselect between 2m history tables, each with n-bit counters Thus, old 2-bit BHT is a (0,2) predictor

    Global Branch History: m-bit shift register keeping T/NTstatus of last mbranches.

    Each entry in table has m n-bit predictors.

    Correlating Branches

  • 7/30/2019 03 Dynamic Sched

    41/84

    8/21/2011 41

    Co e at g a c es

    (2,2) predictor

    Behavior of recent

    branches selectsbetween four

    predictions of next

    branch, updating just

    that prediction

    Branch address

    2-bits per branch predictor

    Prediction

    2-bit global branch history

    4

    Accuracy of Different Schemes

  • 7/30/2019 03 Dynamic Sched

    42/84

    8/21/2011 42

    0%Frequencyo

    fMispredic

    tions

    0%

    1%

    5%6% 6%

    11%

    4%

    6%5%

    1%2%

    4%

    6%

    8%

    10%

    12%

    14%

    16%

    18%

    20%

    4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

    4096 Entries 2-bit BHT

    Unlimited Entries 2-bit BHT

    1024 Entries (2,2) BHT

    matrix300

    doducd

    spice

    fpppp

    gcc

    expresso

    eqntott li

    tomcatv

    Accuracy of Different Schemes

    nasa7

    Tournament Predictors

  • 7/30/2019 03 Dynamic Sched

    43/84

    8/21/2011 43

    Tournament Predictors

    Multilevel branch predictor

    Use n-bit saturating counter to choose between

    predictors

    Usual choice between global and local predictors

    Tournament Predictors

  • 7/30/2019 03 Dynamic Sched

    44/84

    8/21/2011 44

    Tournament Predictors

    Tournament predictor using, say, 4K 2-bit countersindexed by local branch address. Choosesbetween:

    Global predictor

    4K entries index by history of last 12 branches (212 = 4K)

    Each entry is a standard 2-bit predictor

    Local predictor

    Local history table: 1024 10-bit entries recording last 10branches, index by branch address

    The pattern of the last 10 occurrences of that particular branchused to index table of 1K entries with 3-bit saturating counters

    Comparing Predictors (Fig 2 8)

  • 7/30/2019 03 Dynamic Sched

    45/84

    8/21/2011 45

    Comparing Predictors (Fig. 2.8)

    Advantage of tournament predictor is ability toselect the right predictor for a particular branch Particularly crucial for integer benchmarks.

    A typical tournament predictor will select the global predictoralmost 40% of the time for the SPEC integer benchmarks and lessthan 15% of the time for the SPEC FP benchmarks

    Pentium 4 Misprediction Rate( 1000 i t ti t b h)

  • 7/30/2019 03 Dynamic Sched

    46/84

    8/21/2011 46

    (per 1000 instructions, not per branch)

    11

    13

    7

    12

    9

    1

    0 0 0

    5

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    1

    64.gzip

    175.

    vpr

    176.gc

    c

    181.mcf

    186.crafty

    168.w

    upwise

    17

    1.swim

    172.

    mgrid

    17

    3.ap

    plu

    17

    7.mesa

    Bra

    nch

    m

    ispredicti

    onsper1000

    In

    structions

    SPECint2000 SPECfp2000

    6% misprediction rate per branch SPECint(19% of INT instructions are branch)

    2% misprediction rate per branch SPECfp(5% of FP instructions are branch)

    Branch Target Buffers (BTB)

  • 7/30/2019 03 Dynamic Sched

    47/84

    Branch target calculation is costly and stalls theinstruction fetch.

    BTB stores PCs the same way as caches

    The PC of a branch is sent to the BTB

    When a match is found the correspondingPredicted PC is returned

    If the branch was predicted taken, instructionfetch continues at the returned predicted PC

    g ( )

    Branch Target Buffers

  • 7/30/2019 03 Dynamic Sched

    48/84

    Branch Target Buffers

    Dynamic Branch Prediction Summary

  • 7/30/2019 03 Dynamic Sched

    49/84

    8/21/2011 49

    Dynamic Branch Prediction Summary

    Prediction becoming important part of execution

    Branch History Table: 2 bits for loop accuracy

    Correlation: Recently executed branches correlated

    with next branch Either different branches (GA)

    Or different executions of same branches (PA)

    Tournament predictors take insight to next level, byusing multiple predictors usually one based on global information and one based on local

    information, and combining them with a selector

    In 2006, tournament predictors using 30K bits are in processorslike the Power5 and Pentium 4

    Branch Target Buffer: include branch address &prediction

  • 7/30/2019 03 Dynamic Sched

    50/84

    Advantages of Dynamic Scheduling

  • 7/30/2019 03 Dynamic Sched

    51/84

    8/21/2011 51

    d a tages o y a c Sc edu g

    Dynamic scheduling - hardware rearranges theinstruction execution to reduce stalls whilemaintaining data flow and exception behavior

    It handles cases when dependences unknown atcompile time it allows the processor to tolerate unpredictable delays such

    as cache misses, by executing other code while waiting forthe miss to resolve

    It allows code that compiled for one pipeline torun efficiently on a different pipeline

    It simplifies the compiler

    Hardware speculation, a technique withsignificant performance advantages, builds ondynamic scheduling (next lecture)

    HW Schemes: Instruction Parallelism

  • 7/30/2019 03 Dynamic Sched

    52/84

    8/21/2011 52

    HW Schemes: Instruction Parallelism

    Key idea: Allow instructions behind stall to proceedDIVD F0,F2,F4

    ADDD F10,F0,F8SUBD F12,F8,F14

    Enables out-of-order execution and allows out-of-order completion (e.g., SUBD) In a dynamically scheduled pipeline, all instructions still pass

    through issue stage in order (in-order issue)

    Will distinguish when an instruction beginsexecutionand when it completes execution; between2 times, the instruction is in execution

    Note: Dynamic execution creates WAR and WAWhazards and makes exceptions harder

    Dynamic Scheduling Step 1

  • 7/30/2019 03 Dynamic Sched

    53/84

    8/21/2011 53

    y g p

    Simple pipeline had 1 stage to check bothstructural and data hazards: Instruction

    Decode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stage

    pipeline into 2 stages:

    IssueDecode instructions, check forstructural hazards

    Read operandsWait until no data hazards,

    then read operands

    A Dynamic Algorithm: Tomasulos

  • 7/30/2019 03 Dynamic Sched

    54/84

    8/21/2011 54

    y g

    For IBM 360/91 (before caches!) Long memory latency

    Goal: High Performance without special compilers

    Small number of floating point registers (4 in 360)prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers

    renaming in hardware!

    Why Study 1966 Computer?

    The descendants of this have flourished! Alpha 21264, Pentium 4, AMD Opteron, Power 5,

    Tomasulo Algorithm

  • 7/30/2019 03 Dynamic Sched

    55/84

    8/21/2011 55

    g

    Control & buffers distributed with Function Units (FU)FU buffers called reservation stations; have pending operands

    Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;Renaming avoids WAR, WAW hazards

    More reservation stations than registers, so can do optimizations

    compilers cant Results to FU from RS, not through registers, over

    Common Data Bus that broadcasts results to all FUs Avoids RAW hazards by executing an instruction only when its

    operands are available Load and Stores treated as FUs with RSs as well

    Integer instructions can go past branches (predicttaken), allowing FP ops beyond basic block in FP queue

    Tomasulo Organization

  • 7/30/2019 03 Dynamic Sched

    56/84

    8/21/2011 56

    FP addersFP adders

    Add1Add2Add3

    FP multipliersFP multipliers

    Mult1Mult2

    From Mem FP Registers

    Reservation

    Stations

    Common Data Bus (CDB)

    To Mem

    FP OpQueue

    Load Buffers

    StoreBuffers

    Load1

    Load2Load3Load4Load5Load6

    Reservation Station Components

  • 7/30/2019 03 Dynamic Sched

    57/84

    8/21/2011 57

    Reservation Station Components

    Op: Operation to perform in the unit (e.g., + or )

    Vj, Vk: Value of Source operands

    Store buffers has V field, result to be stored

    Qj, Qk: Reservation stations producing sourceregisters (value to be written)

    Note: Qj,Qk=0 => ready Store buffers only have Qi for RS producing result

    Busy: Indicates reservation station or FU is busy

    Register result statusIndicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.

    Three Stages of Tomasulo Algorithm

  • 7/30/2019 03 Dynamic Sched

    58/84

    8/21/2011 58

    1. Issueget instruction from FP Op QueueIf reservation station free (no structural hazard),control issues instr & sends operands (renames registers).

    2. Executeoperate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result

    3. Write resultfinish execution (WB)Write on Common Data Bus to all awaiting units;mark reservation station available

    Normal data bus: data + destination (go to bus)

    Common data bus: data + source (come from bus) 64 bits of data + 4 bits of Functional Unit source address

    Write if matches expected Functional Unit (produces result)

    Does the broadcast

    Example speed:3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /

    Tomasulo ExampleInstruction stream

  • 7/30/2019 03 Dynamic Sched

    59/84

    8/21/2011 59

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 No

    LD F2 45+ R3 Load2 No

    MULTD F0 F2 F4 Load3 No

    SUBD F8 F6 F2

    DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 No

    Add3 No

    Mult1 No

    Mult2 No

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

    Clock cycle

    counter

    FU countdown

    3 Load/Buffers

    3 FP Adder R.S.2 FP Mult R.S.

    Tomasulo Example Cycle 1

  • 7/30/2019 03 Dynamic Sched

    60/84

    8/21/2011 60

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2

    LD F2 45+ R3 Load2 No

    MULTD F0 F2 F4 Load3 No

    SUBD F8 F6 F2

    DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 No

    Add3 No

    Mult1 No

    Mult2 No

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

    Tomasulo Example Cycle 2

  • 7/30/2019 03 Dynamic Sched

    61/84

    8/21/2011 61

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2

    LD F2 45+ R3 2 Load2 Yes 45+R3

    MULTD F0 F2 F4 Load3 No

    SUBD F8 F6 F2

    DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    Mult1 No

    Mult2 No

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

    Note: Can have multiple loads outstanding

    Tomasulo Example Cycle 3I i

  • 7/30/2019 03 Dynamic Sched

    62/84

    8/21/2011 62

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2

    LD F2 45+ R3 2 Load2 Yes 45+R3

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2

    DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    Mult1 Yes ULTD R(F4) Load2

    Mult2 No

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

    Note: registers names are removed (renamed) in ReservationStations; MULT issued

    Load1 completing; what is waiting for Load1?

    Tomasulo Example Cycle 4I t ti t t E W i

  • 7/30/2019 03 Dynamic Sched

    63/84

    8/21/2011 63

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 Load2 Yes 45+R3

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4

    DIVD F10 F0 F6ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 Yes SUBD M(A1) Load2

    Add2 NoAdd3 No

    Mult1 Yes ULTD R(F4) Load2

    Mult2 No

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

    Load2 completing; what is waiting for Load2?

    Tomasulo Example Cycle 5Instruction status: E W it

  • 7/30/2019 03 Dynamic Sched

    64/84

    8/21/2011 64

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4

    DIVD F10 F0 F6 5ADDD F6 F8 F2

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    2 Add1 Yes SUBD M(A1) M(A2)

    Add2 NoAdd3 No

    10 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

    Timer starts down for Add1, Mult1

    Tomasulo Example Cycle 6Instruction status: E W it

  • 7/30/2019 03 Dynamic Sched

    65/84

    8/21/2011 65

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4

    DIVD F10 F0 F6 5ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    1 Add1 Yes SUBD M(A1) M(A2)

    Add2 Yes ADDD M(A2) Add1Add3 No

    9 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

    Issue ADDD here despite name dependency on F6?

    Tomasulo Example Cycle 7Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    66/84

    8/21/2011 66

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    0 Add1 Yes SUBD M(A1) M(A2)

    Add2 Yes ADDD M(A2) Add1Add3 No

    8 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

    Add1 (SUBD) completing; what is waiting for it?

  • 7/30/2019 03 Dynamic Sched

    67/84

    Tomasulo Example Cycle 9Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    68/84

    8/21/2011 68

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    1 Add2 Yes ADDD (M-M) M(A2)Add3 No

    6 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2

    Tomasulo Example Cycle 10Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    69/84

    8/21/2011 69

    Instruction status: Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    0 Add2 Yes ADDD (M-M) M(A2)Add3 No

    5 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

    Add2 (ADDD) completing; what is waiting for it?

    Tomasulo Example Cycle 11Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    70/84

    8/21/2011 70

    Exec Write

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    4 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+ (M-M) Mult2

    Write result of ADDD here?

    All quick instructions complete in this cycle!

    Tomasulo Example Cycle 12Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    71/84

    8/21/2011 71

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    3 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+ (M-M) Mult2

    Tomasulo Example Cycle 13Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    72/84

    8/21/2011 72

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    2 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+ (M-M) Mult2

    Tomasulo Example Cycle 14Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    73/84

    8/21/2011 73

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    1 Mult1 Yes ULTD M(A2) R(F4)

    Mult2 Yes DIVD M(A1) Mult1

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+ (M-M) Mult2

  • 7/30/2019 03 Dynamic Sched

    74/84

    Tomasulo Example Cycle 16Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    75/84

    8/21/2011 75

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 15 16 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    Mult1 No

    40 Mult2 Yes DIVD M*F4 M(A1)

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+ (M-M) Mult2

    Just waiting for Mult2 (DIVD) to complete

  • 7/30/2019 03 Dynamic Sched

    76/84

    8/21/2011 76

    Faster than light computation(skip a couple of cycles)

    Tomasulo Example Cycle 55Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    77/84

    8/21/2011 77

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 15 16 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    Mult1 No

    1 Mult2 Yes DIVD M*F4 M(A1)

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+ (M-M) Mult2

    Tomasulo Example Cycle 56Instruction status: Exec Write

  • 7/30/2019 03 Dynamic Sched

    78/84

    8/21/2011 78

    Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 No

    LD F2 45+ R3 2 4 5 Load2 No

    MULTD F0 F2 F4 3 15 16 Load3 No

    SUBD F8 F6 F2 4 7 8

    DIVD F10 F0 F6 5 56

    ADDD F6 F8 F2 6 10 11

    Reservation Stations: S1 S2 RS RS

    Time Name Busy Op Vj Vk Qj Qk

    Add1 No

    Add2 NoAdd3 No

    Mult1 No

    0 Mult2 Yes DIVD M*F4 M(A1)

    Register result status:

    Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+ (M-M) Mult2

    Mult2 (DIVD) is completing; what is waiting for it?

  • 7/30/2019 03 Dynamic Sched

    79/84

    Why can Tomasulo overlapiterations of loops?

  • 7/30/2019 03 Dynamic Sched

    80/84

    8/21/2011 80

    Register renamingMultiple iterations use different physical destinations for

    registers (dynamic loop unrolling).

    Reservation stationsPermit instruction issue to advance past integer control flow

    operations

    Also buffer old values of registers - totally avoiding the WARstall

    Other perspective: Tomasulo building dataflow dependency graph on the fly

    Tomasulos scheme offers 2 majoradvantages

  • 7/30/2019 03 Dynamic Sched

    81/84

    8/21/2011 81

    1. Distribution of the hazard detection logic distributed reservation stations and the CDB

    If multiple instructions waiting on single result, & eachinstruction has other operand, then instructions can bereleased simultaneously by broadcast on CDB

    If a centralized register file were used, the units wouldhave to read their results from the registers whenregister buses are available

    2. Elimination of stalls for WAW and WARhazards

    Tomasulo Drawbacks

  • 7/30/2019 03 Dynamic Sched

    82/84

    8/21/2011 82

    Complexitydelays of 360/91, MIPS 10000, Alpha 21264,

    IBM PPC 620 in CA:AQA 2/e, but not in silicon!

    Many associative stores (CDB) at high speed Performance limited by Common Data Bus

    Each CDB must go to multiple functional units

    high capacitance, high wiring densityNumber of functional units that can complete per cycle

    limited to one!

    Multiple CDBs more FU logic for parallel assoc stores

    Non-precise interrupts!We will address this later

    And In Conclusion #1

    Leverage Implicit Parallelism for Performance:

  • 7/30/2019 03 Dynamic Sched

    83/84

    8/21/2011 83

    g pInstruction Level Parallelism

    Loop unrolling by compiler to increase ILP

    Branch prediction to increase ILP

    Dynamic HW exploiting ILP Works when cant know dependence at compile time

    Can hide L1 cache misses

    Code for one machine runs well on another

    And In Conclusion #2

    Reservations stations: renamingto larger set of

  • 7/30/2019 03 Dynamic Sched

    84/84

    8/21/2011 84

    g gregisters + buffering source operands

    Prevents registers as bottleneck

    Avoids WAR, WAW hazards

    Allows loop unrolling in HW

    Not limited to basic blocks(integer units gets ahead, beyond branches)

    Helps cache misses as well Lasting Contributions

    Dynamic scheduling

    Register renaming Load/store disambiguation

    360/91 descendants are Intel Pentium 4, IBM Power 5,

    AMD Athlon/Opteron,