cs7103 multi core architreture cycle test

Upload: ammapet

Post on 02-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 CS7103 multi core architreture cycle test

    1/20

    DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

    PERAMBALUR 621 212

    CYCLE TEST I NOVEMBER 2014

    Part A (10 X 2 = 20)

    1. D!"# A$%a&'() La*+

    Amdahls Law states that the performance improvement to be gained from using some faster

    mode of eecution is limited b! the fraction of the time the faster mode can be used" Amdahls Law

    defines the speedup that can be gained b! using a particular feature" #hat is speedup$ %uppose that

    we can ma&e an enhancement to a computer that will improve performance when it is used"

    %peedup = 'erformance for entire tas& using the enhancement when possible

    'erformance for entire tas& without using the enhancement

    2. ,&at ") I#)tr-t"/# L' Para'''")$+

    nstructionlevel parallelism (L') is a measure of how man! of the operations in a computer

    program can be performed simultaneousl!" *he potential overlap among instructions is called

    instruction level parallelism"

    . "#% t& #-$3r /! %") r 00$$ 50$ *a!r !/r a %" &at ") 1.7$ /# a )"% a#% !/r

    a %" t&at ") 1.0$ /# a )"%.

    4 D!"# r)/#) t"$ a#% t&r/-8&-t.

    R)/#) t"$

    Also called eecution time" *he total time re+uired for the computer to complete a tas&,

    including dis& accesses, memor! accesses, -. activities, operating s!stem overhead, /'

    eecution time, and so on"

    T&r/-8&-t

    Also called bandwidth" A nother measure of performance, it is the number of tas&s completed

    per unit time"

  • 8/10/2019 CS7103 multi core architreture cycle test

    2/20

    7 L")t t& ar"/-) 'a))) /! /$-tr).

    o %uper computer

    o ainframe /omputer

    o inicomputer

    o

    icrocomputer 6. ,&at ar t& ar"/-) t9) /! %#%#")+

    *here are t!pes of data dependencies" *he! are as follows3

    (1) 4low dependence

    (2) Antidependence

    (5) .utput dependence

    (6) -. dependence

    () n&nown dependence

    :. ,&at ar t& r"$ar9 /$/##t) /! t/r ar&"tt-r+7ector register

    7ector functional units

    7ector load-store unit

    %et of scalar register

    8" D!"# )tr" $"#"#8 *"t& ;a$'.

    #hen loops are shorter, vector architectures use a register that reduces the length of vector

    operations" #hen loops are larger, we add boo&&eeping code to iterate fulllength vector operationsand to handle the leftovers" *his latter process is called strip mining

    9" D!"# 8at&r

  • 8/10/2019 CS7103 multi core architreture cycle test

    3/20

    for(i=0i>100i=i?1)

    @

    Ai?1B=AiB?/iB -C%1C-

    Di?1B=DiB?AiB -C%2C-

    E

    #hat are the dependences between %1 and %2 in the loop$

    A#)*r

    *here are two different dependences3

    1" %1 uses a value computed b! %1 in an earlier iteration, since iteration i computes Ai?1B,

    which is read in iteration i?1" *he same is true of %2 for DiB and Di?1B"

    2" %2 uses the value, Ai?1B, computed b! %1 in the same iteration"

    *hese two dependences are different and have different effects" *o see how the! differ, lets assume

    that onl! one of these dependences eists at a time" Decause the dependence of statement %1 is on an

    earlier iteration of %1, this dependence is loop carried" *his dependence forces successive iterations

    of this loop to eecute in series"

    *he second dependence (%2 depending on %1) is within an iteration and is not loop carried" *hus, if

    this were the onl! dependence, multiple iterations of the loop could eecute in parallel, as long as

    each pair of statements in an iteration were &ept in order" #e saw this t!pe of dependence in an

    eample in %ection 5"2, where unrolling was able to epose the parallelism"

    t is also possible to have a loopcarried dependence that does not prevent parallelism, as the net

    eample shows"

    https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2
  • 8/10/2019 CS7103 multi core architreture cycle test

    4/20

    Part B

    A#)*r A'' t& >-)t"/#). (X1F=80)

    11. a. E;'a"# t& /#t) a#% &a''#8) /! I#)tr-t"/# L' Para'''")$ 5ILP.

    nstructionlevel parallelism (L') is a measure of how man! of the operations in a computer

    program can be performed simultaneousl!" *he potential overlap among instructions is called

    instruction level parallelism"

    *here are two largel! separable approaches to eploiting L'3 an approach that relies on

    hardware to help discover and eploit the parallelism d!namicall!, and an approach that relies on

    software technolog! to find parallelism, staticall! at compile time" 'rocessors using the d!namic,

    hardwarebased approach, including the ntel 'entium series, dominate in the mar&et those using the

    static approach, including the ntel tanium, have more limited uses in scientific or application

    specific environments"

    *he simplest and most common wa! to increase the L' is to eploit parallelism among

    iterations of a loop" *his t!pe of parallelism is often called looplevel parallelism" Gere is a simpleeample of a loop, which adds two 1000element arra!s, that is completel! parallel3

    !/r

    (i=1 i>=1000 i=i?1)

    iB = iB ? !iB

    Hver! iteration of the loop can overlap with an! other iteration, although within each loop iteration

    there is little or no opportunit! for overlap"

    Data D#%#) a#% Ha?ar%)

    ;etermining how one instruction depends on another is critical to determining how much

    parallelism eists in a program and how that parallelism can be eploited" n particular, to eploitinstructionlevel parallelism we must determine which instructions can be eecuted in parallel" f two

    instructions are parallel, the! can eecute simultaneousl! in a pipeline of arbitrar! depth without

    causing an! stalls, assuming the pipeline has sufficient resources (and hence no structural ha

  • 8/10/2019 CS7103 multi core architreture cycle test

    5/20

    #A# (write after write)KI tries to write an operand before it is written b! i" *he writes end up being

    performed in the wrong order, leaving the value written b! i rather than the value written b! I in the

    destination" *his ha

  • 8/10/2019 CS7103 multi core architreture cycle test

    6/20

    12. a. What is multicore processor? Explain how a multicore processor works.

    A multicore processor is a processing s!stem composed of two or more independent cores

    (or /'s)" *he cores are t!picall! integrated onto a single integrated circuit die (&nown as a chip

    multiprocessor or /'), or the! ma! be integrated onto multiple dies in a single chip pac&age"

    A multicore processor implements multiprocessing in a single ph!sical pac&age" /ores in a

    multicore device ma! be coupled together tightl! or loosel!" 4or eample, cores ma! or ma! not

    share caches, and the! ma! implement message passing or shared memor! intercore communication

    methods" /ommon networ& topologies to interconnect cores include3 bus, ring, 2dimentional mesh,

    and crossbar"

    All cores are identical in s!mmetric multicore s!stems and the! are not identical in

    as!mmetric multicore s!stems" Nust as with singleprocessor s!stems, cores in multicore s!stems

    ma! implement architectures such as superscalar, vector processing, or multithreading"

    n this design, each core has its own eecution pipeline" And each core has the resources

    re+uired to run without bloc&ing resources needed b! the other software threads"

    #hile the eample in 4igure 2 shows a twocore design, there is no inherent limitation in the numberof cores that can be placed on a single chip" ntel has committed to shipping dualcore processors in

    200, but it will add additional cores in the future" ainframe processors toda! use more than two

    cores, so there is precedent for this &ind of development"

    *he multicore design enables two or more cores to run at somewhat slower speeds and at much

    lower temperatures" *he combined throughput of these cores delivers processing power greater than

    the maimum available toda! on singlecore processors and at a much lower level of power

    consumption" n this wa!, ntel increases the capabilities of server platforms as predicted b! oores

    Law while the technolog! no longer pushes the outer limits of ph!sical constraints"

  • 8/10/2019 CS7103 multi core architreture cycle test

    7/20

    12. 3 . D")-)) A$%a&'() La* a#% &/* Pr/))/r S%- ") a'-'at% ;'a"# *"t& a#

    ;a$'.

    Amdahls Law states that the performance improvement to be gained from using some faster

    mode of eecution is limited b! the fraction of the time the faster mode can be used" Amdahls Law

    defines the speedup that can be gained b! using a particular feature" #hat is speedup$ %uppose thatwe can ma&e an enhancement to a computer that will improve performance when it is used"

    Amdahls Law defines thespeedup that can be gained b! using a particular feature" #hat is speedup$

    %uppose that we can ma&e an enhancement to a computer that will improve performance when it is

    used" %peedup is the ratio

    %peedup = 'erformance for entire tas& using the enhancement when possible

    'erformance for entire tas& without using the enhancement

    Alternativel!

    %peedup = Hecution time for entire tas& without using the enhancement

    Hecution time for entire tas& using the enhancement when possible

    %peedup tells us how much faster a tas& will run using the computer with the enhancement as

    opposed to the original computer"

    Amdahls Law gives us a +uic& wa! to find the speedup from some enhancement, which depends on

    two factors3

    1" *he fraction of the computation time in the original computer that can be converted to ta&e

    advantage of the enhancementK4or eample, if 20 seconds of the eecution time of aprogram that ta&es F0 seconds in total can use an enhancement, the fraction is 20-F0" *his

    value, which we will call 4raction enhanced, is alwa!s less than or e+ual to 1"

    2" *he improvement gained b! the enhanced eecution mode that is, how much faster the tas& would

    run if the enhanced mode were used for the entire programK *his value is the time of the original

    mode over the time of the enhanced mode" f the enhanced mode ta&es, sa!, 2 seconds for a portion

    of the program, while it is seconds in the original mode, the improvement is -2" #e will call this

    value, which is alwa!s greater than 1, %peedup enhanced"

    *he eecution time using the original computer with the enhanced mode will be the time spent using

    the unenhanced portion of the computer plus the time spent using the enhancement3

  • 8/10/2019 CS7103 multi core architreture cycle test

    8/20

    1. a. E;'a"# tr#%) "# /*r@ #r89@ /)t a#% t/'/89 "# "#t8rat% "r-"t) *"t& ;a$'.

    E#r89 a#% P/*r *"t&"# a M"r/r/))/r

    4or /.% chips, the traditional primar! energ! consumption has been in switching transistors, alsocalled dynamic energy. *he energ! re+uired per transistor is proportional to the product of the

    capacitive load driven b! the transistor and the s+uare of the voltage3

    *his e+uation is the energ! of pulse of the logic transition of 0O1O0 or 1O0O1" *he energ! of a

    single transition (0O1 or 1O0) is then3

    *he power re+uired per transistor is Iust the product of the energ! of a transition multiplied b! the

    fre+uenc! of transitions3

    4or a fied tas&, slowing cloc& rate reduces power, but not energ!" /learl!, d!namic power and

    energ! are greatl! reduced b! lowering the voltage, so voltages have dropped from 7 to Iust under

    17 in 20 !ears" *he capacitive load is a function of the number of transistors connected to an output

    and the technolog!, which determines the capacitance of the wires and the transistors"

    E;a$' %ome microprocessors toda! are designed to have adIustable voltage, so a 1P

    reduction in voltage ma! result in a 1P reduction in fre+uenc!" #hat would be the impact on

    d!namic energ! and on d!namic power$

    A#)*r %ince the capacitance is unchanged, the answer for energ! is the ratio of the voltages

    since the capacitance is unchanged3

  • 8/10/2019 CS7103 multi core architreture cycle test

    9/20

    ntegrated circuit costs are becoming a greater portion of the cost that varies between computers,

    especiall! in the highvolume, costsensitive portion of the mar&et" ndeed, with personal mobile

    devices increasing reliance of wholesystems on a chip (%./), the cost of the integrated

    circuits is much of the cost of the ';" *hus, computer designers must understand the costs of chips

    to understand the costs of current computers" Although the costs of integrated circuits have dropped

    eponentiall!, the basic process of silicon manufacture is unchanged3 A wafer is still tested andchopped into dies that are pac&aged)" *hus, the cost of a pac&aged integrated circuit is

    n this section, we focus on the cost of dies, summari

  • 8/10/2019 CS7103 multi core architreture cycle test

    10/20

    S2: moe R1, R!

    IO dependence:

    Jead and write are -. statements" -. dependence occurs not because the same

    variable is involved but because the same file is referenced b! both -. statements"

    !nknown dependence:

    *he dependence relation between two statements cannot be determined in the

    following situations"

    "he su#script o a aria#le itsel su#scri#ed.

    "he su#script does not contain the loop index aria#le.

    A aria#le appears more than once with su#scripts haing di$erent

    coe%cients o the loop aria#le. "he su#script is nonlinear in the loop index aria#le.

    #hen one or more of these conditions eists, a conservative assumption is to claim

    un&nown dependence among the statements involved"

    "". "#% a'' t& tr- %#%#)@ /-t-t %#%#) a#% a#t"%#%#) a#% '"$"#at t&

    /-t-t %#%#) a#% a#t"%#%#) 39 r#a$"#8.

    !/r5"0"100""1

    YF"F" C JS1J

    F"F" C JS2J

    F"YF" C JSJ

    YF" C < YF" JS4J

    A#)*r

    *he following dependences eist among the four statements3

    1" *here are true dependences from %1 to %5 and from %1 to %6 because of QiB" *hese are not

    loop carried, so the! do not prevent the loop from being considered parallel" *hese

    dependences will force %5 and %6 to wait for %1 to complete"

    2" *here is an antidependence from %1 to %2, based on XiB"

    5" *here is an antidependence from %5 to %6 for QiB"

    6" *here is an output dependence from %1 to %6, based on QiB"

    *he following version of the loop eliminates these false (or pseudo) dependences3

  • 8/10/2019 CS7103 multi core architreture cycle test

    11/20

    After the loop, the variable X has been renamed X1" n code that follows the loop, the compiler can

    simpl! replace the name X b! X1" n this case, renaming does not re+uire an actual cop! operation

    but can be done b! substituting names or b! register allocation" n other cases, however, renaming

    will re+uire cop!ing"

    14. a. E;'a"# t/r ar&"tt-r *"t& #at %"a8ra$ a#% 8" t& )-"ta3' ;a$'

    #e begin with a vector processor consisting of the primar! components that 4ig shows" *his

    processor, which is loosel! based on the /ra!1, is the foundation for discussion throughout this

    section" #e will call this instruction set architecture 7'% its scalar portion is '%, and its vectorportion is the logical vector etension of '%" *he rest of this subsection eamines how the basic

    architecture of 7'% relates to other processors"

  • 8/10/2019 CS7103 multi core architreture cycle test

    12/20

    *he primar! components of the instruction set architecture of 7'% are the following3

    Vt/r r8")tr)KHach vector register is a fiedlength ban& holding a single vector" 7'%

    has eight vector registers, and each vector register holds F6 elements, each F6 bits wide" *he

    vector register file needs to provide enough ports to feed all the vector functional units" *hese

    ports will allow a high degree of overlap among vector operations to different vectorregisters" *he read and write ports, which total at least 1F read ports and 8 write ports, are

    connected to the functional unit inputs or outputs b! a pair of crossbar switches"

    Vt/r !-#t"/#a' -#"t)KHach unit is full! pipelined, and it can start a new operation on

    ever! cloc& c!cle" A control unit is needed to detect ha

  • 8/10/2019 CS7103 multi core architreture cycle test

    13/20

    their parallel algorithms"

    &ith these re'uests in mind, the (ermi team designed a processor that greatl)increases rawcompute horsepower, and through architectural innoations, also o$ers dramaticall)increasedprogramma#ilit) and compute e%cienc). "he *e) architectural highlights o (ermi are:

    "hird #eneration $treamin% &ultiprocessor '$&( !2 +A cores per S, /x oer 0"2

    x the pea* dou#le precision foating point perormance oer 0"2

    ual &arp Scheduler simultaneousl) schedules and dispatches instructions

    rom two independent warps

    3/ 45 o RA with a con6gura#le partitioning o shared memor) and 71 cache

    $econd #eneration )arallel "hread Execution I$A ni6ed Address Space with (ull +88 Support 9ptimied or 9pen+7 and irect+ompute (ull ;EEE erormance through >redication

    Impro*ed &emor+ $u,s+stem ?@;;A >arallel ata+ache" hierarch) with +on6gura#le 71 and ni6ed 72

    +aches (irst 0> with E++ memor) support

    0reatl) improed atomic memor) operation perormance

    I/IA #i%a"hread"& En%ine

    1x aster application context switching

    +oncurrent *ernel execution

    9ut o 9rder thread #loc* execution

    ual oerlapped memor) transer engines

  • 8/10/2019 CS7103 multi core architreture cycle test

    14/20

  • 8/10/2019 CS7103 multi core architreture cycle test

    15/20

    "". H/* $-'t"' 'a#) -)% !/r 39/#% /# '$#t r '/ a#% ;'a"# &/* t/ &a#%'"#8

    '//) #/t -a' t/ 64.

    B9/#% O# E'$#t r C'/ C9'

    A critical advantage of a vector instruction set is that it allows software to pass a large amount of

    parallel wor& to hardware using onl! a single short instruction" A single vector instruction can

    include scores of independent operations !et be encoded in the same number of bits as a

    conventional scalar instruction" *he parallel semantics of a vector instruction allow an

    implementation to eecute these elemental operations using a deepl! pipelined functional unit, as in

    the 7'% implementation weve studied so far an arra! of parallel functional units or a

    combination of parallel and pipelined functional units" 4igure 6"6illustrates how to improve vector

    performance b! using parallel pipelines to eecute a vector add instruction"

    https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-4/figure-4-4https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-4/figure-4-4https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-4/figure-4-4
  • 8/10/2019 CS7103 multi core architreture cycle test

    16/20

    Ha#%'"#8 L//) N/t E-a' t/ 64

    A vector register processor has a natural vector length determined b! the number of elements

    in each vector register" *his length, which is F6 for 7'%, is unli&el! to match the real vector

    length in a program" oreover, in a real program the length of a particular vector operation is often

    un&nown at compile time" n fact, a single piece of code ma! re+uire different vector lengths" 4oreample, consider this code3"

    *he si

  • 8/10/2019 CS7103 multi core architreture cycle test

    17/20

    17.a.". E;'a"# "# %ta"' a3/-t Gra&") Pr/))"#8 -#"t.

    :' with hundreds of parallel floatingpoint units, which ma&es highperformance

    computing more accessible" *he interest in :' computing blossomed when this potential was

    combined with a programming language that made :'s easier to program" Gence, man!

    programmers of scientific and multimedia applications toda! are pondering whether to use :'s or

    /'s"Pr/8ra$$"#8 t& GPU

    /;A is an elegant solution to the problem of representing parallelism in algorithms, not all

    algorithms, but enough to matter" t seems to resonate in some wa! with the wa! we thin& and code,

    allowing an easier, more natural epression of parallelism be!ond the tas& level"

    *he :' hardware handles parallel eecution and thread management it is not done b!

    applications or b! the operating s!stem" *o simplif! scheduling b! the hardware, /;A re+uires that

    thread bloc&s be able to eecute independentl! and in an! order" ;ifferent thread bloc&s cannot

    communicate directl!, although the! can coordinate using atomic memor! operations in :lobal

    emor!"As we shall soon see, man! :' hardware concepts are not obvious in /;A" *hat is a

    good thing from a programmer productivit! perspective, but most programmers are using :'s

    instead of /'s to get performance" 'erformance programmers must &eep the :' hardware in

    mind when writing in /;A" 4or reasons eplained shortl!, the! &now that the! need to &eep

    groups of 52 threads together in control flow to get the best performance from multithreaded %;

    'rocessors, and create man! more threads per multithreaded %; 'rocessor to hide latenc! to

    ;JA" *he! also need to &eep the data addresses locali

  • 8/10/2019 CS7103 multi core architreture cycle test

    18/20

  • 8/10/2019 CS7103 multi core architreture cycle test

    19/20

    17.3. ".H/* *"'' 9/- %tt a#% #&a# '// '' ara'''")$+

    Looplevel parallelism is normall! anal!100i=i?1)

    @

    Ai?1B=AiB?/iB -C%1C-

    Di?1B=DiB?AiB -C%2C-

    E#hat are the dependences between %1 and %2 in the loop$

    A#)*r

    *here are two different dependences3

    1" %1 uses a value computed b! %1 in an earlier iteration, since iteration i computes Ai?1B,

    which is read in iteration i?1" *he same is true of %2 for DiB and Di?1B"

    2" %2 uses the value, Ai?1B, computed b! %1 in the same iteration"

    https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2
  • 8/10/2019 CS7103 multi core architreture cycle test

    20/20

    *hese two dependences are different and have different effects" *o see how the! differ, lets assume

    that onl! one of these dependences eists at a time" Decause the dependence of statement %1 is on an

    earlier iteration of %1, this dependence is loop carried" *his dependence forces successive iterations

    of this loop to eecute in series"

    *he second dependence (%2 depending on %1) is within an iteration and is not loop carried" *hus, ifthis were the onl! dependence, multiple iterations of the loop could eecute in parallel, as long as

    each pair of statements in iteration were &ept in order" #e saw this t!pe of dependence in an eample

    in %ection 5"2, where unrolling was able to epose the parallelism"

    t is also possible to have a loopcarried dependence that does not prevent parallelism, as the net

    eample shows

    https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2