cs7103 multi core architreture cycle test

8/10/2019 CS7103 multi core architreture cycle test

1/20

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

PERAMBALUR 621 212

CYCLE TEST I NOVEMBER 2014

Part A (10 X 2 = 20)

1. D!"# A$%a&'() La*+

Amdahls Law states that the performance improvement to be gained from using some faster

mode of eecution is limited b! the fraction of the time the faster mode can be used" Amdahls Law

defines the speedup that can be gained b! using a particular feature" #hat is speedup$ %uppose that

we can ma&e an enhancement to a computer that will improve performance when it is used"

%peedup = 'erformance for entire tas& using the enhancement when possible

'erformance for entire tas& without using the enhancement

2. ,&at ") I#)tr-t"/# L' Para'''")$+

nstructionlevel parallelism (L') is a measure of how man! of the operations in a computer

program can be performed simultaneousl!" *he potential overlap among instructions is called

instruction level parallelism"

. "#% t& #-$3r /! %") r 00$$ 50$ *a!r !/r a %" &at ") 1.7$ /# a )"% a#% !/r

a %" t&at ") 1.0$ /# a )"%.

4 D!"# r)/#) t"$ a#% t&r/-8&-t.

R)/#) t"$

Also called eecution time" *he total time re+uired for the computer to complete a tas&,

including dis& accesses, memor! accesses, -. activities, operating s!stem overhead, /'

eecution time, and so on"

T&r/-8&-t

Also called bandwidth" A nother measure of performance, it is the number of tas&s completed

per unit time"


2/20

7 L")t t& ar"/-) 'a))) /! /$-tr).

o %uper computer

o ainframe /omputer

o inicomputer

o

icrocomputer 6. ,&at ar t& ar"/-) t9) /! %#%#")+

*here are t!pes of data dependencies" *he! are as follows3

(1) 4low dependence

(2) Antidependence

(5) .utput dependence

(6) -. dependence

() n&nown dependence

:. ,&at ar t& r"$ar9 /$/##t) /! t/r ar&"tt-r+7ector register

7ector functional units

7ector load-store unit

%et of scalar register

8" D!"# )tr" $"#"#8 *"t& ;a$'.

#hen loops are shorter, vector architectures use a register that reduces the length of vector

operations" #hen loops are larger, we add boo&&eeping code to iterate fulllength vector operationsand to handle the leftovers" *his latter process is called strip mining

9" D!"# 8at&r


3/20

for(i=0i>100i=i?1)

@

Ai?1B=AiB?/iB -C%1C-

Di?1B=DiB?AiB -C%2C-

E

#hat are the dependences between %1 and %2 in the loop$

A#)*r

*here are two different dependences3

1" %1 uses a value computed b! %1 in an earlier iteration, since iteration i computes Ai?1B,

which is read in iteration i?1" *he same is true of %2 for DiB and Di?1B"

2" %2 uses the value, Ai?1B, computed b! %1 in the same iteration"

*hese two dependences are different and have different effects" *o see how the! differ, lets assume

that onl! one of these dependences eists at a time" Decause the dependence of statement %1 is on an

earlier iteration of %1, this dependence is loop carried" *his dependence forces successive iterations

of this loop to eecute in series"

*he second dependence (%2 depending on %1) is within an iteration and is not loop carried" *hus, if

this were the onl! dependence, multiple iterations of the loop could eecute in parallel, as long as

each pair of statements in an iteration were &ept in order" #e saw this t!pe of dependence in an

eample in %ection 5"2, where unrolling was able to epose the parallelism"

t is also possible to have a loopcarried dependence that does not prevent parallelism, as the net

eample shows"
https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-3/section-3-2


4/20

Part B

A#)*r A'' t& >-)t"/#). (X1F=80)

11. a. E;'a"# t& /#t) a#% &a''#8) /! I#)tr-t"/# L' Para'''")$ 5ILP.

nstructionlevel parallelism (L') is a measure of how man! of the operations in a computer

program can be performed simultaneousl!" *he potential overlap among instructions is called

instruction level parallelism"

*here are two largel! separable approaches to eploiting L'3 an approach that relies on

hardware to help discover and eploit the parallelism d!namicall!, and an approach that relies on

software technolog! to find parallelism, staticall! at compile time" 'rocessors using the d!namic,

hardwarebased approach, including the ntel 'entium series, dominate in the mar&et those using the

static approach, including the ntel tanium, have more limited uses in scientific or application

specific environments"

*he simplest and most common wa! to increase the L' is to eploit parallelism among

iterations of a loop" *his t!pe of parallelism is often called looplevel parallelism" Gere is a simpleeample of a loop, which adds two 1000element arra!s, that is completel! parallel3

!/r

(i=1 i>=1000 i=i?1)

iB = iB ? !iB

Hver! iteration of the loop can overlap with an! other iteration, although within each loop iteration

there is little or no opportunit! for overlap"

Data D#%#) a#% Ha?ar%)

;etermining how one instruction depends on another is critical to determining how much

parallelism eists in a program and how that parallelism can be eploited" n particular, to eploitinstructionlevel parallelism we must determine which instructions can be eecuted in parallel" f two

instructions are parallel, the! can eecute simultaneousl! in a pipeline of arbitrar! depth without

causing an! stalls, assuming the pipeline has sufficient resources (and hence no structural ha


5/20

#A# (write after write)KI tries to write an operand before it is written b! i" *he writes end up being

performed in the wrong order, leaving the value written b! i rather than the value written b! I in the

destination" *his ha


6/20

12. a. What is multicore processor? Explain how a multicore processor works.

A multicore processor is a processing s!stem composed of two or more independent cores

(or /'s)" *he cores are t!picall! integrated onto a single integrated circuit die (&nown as a chip

multiprocessor or /'), or the! ma! be integrated onto multiple dies in a single chip pac&age"

A multicore processor implements multiprocessing in a single ph!sical pac&age" /ores in a

multicore device ma! be coupled together tightl! or loosel!" 4or eample, cores ma! or ma! not

share caches, and the! ma! implement message passing or shared memor! intercore communication

methods" /ommon networ& topologies to interconnect cores include3 bus, ring, 2dimentional mesh,

and crossbar"

All cores are identical in s!mmetric multicore s!stems and the! are not identical in

as!mmetric multicore s!stems" Nust as with singleprocessor s!stems, cores in multicore s!stems

ma! implement architectures such as superscalar, vector processing, or multithreading"

n this design, each core has its own eecution pipeline" And each core has the resources

re+uired to run without bloc&ing resources needed b! the other software threads"

#hile the eample in 4igure 2 shows a twocore design, there is no inherent limitation in the numberof cores that can be placed on a single chip" ntel has committed to shipping dualcore processors in

200, but it will add additional cores in the future" ainframe processors toda! use more than two

cores, so there is precedent for this &ind of development"

*he multicore design enables two or more cores to run at somewhat slower speeds and at much

lower temperatures" *he combined throughput of these cores delivers processing power greater than

the maimum available toda! on singlecore processors and at a much lower level of power

consumption" n this wa!, ntel increases the capabilities of server platforms as predicted b! oores

Law while the technolog! no longer pushes the outer limits of ph!sical constraints"


7/20

12. 3 . D")-)) A$%a&'() La* a#% &/* Pr/))/r S%- ") a'-'at% ;'a"# *"t& a#

;a$'.

Amdahls Law states that the performance improvement to be gained from using some faster

mode of eecution is limited b! the fraction of the time the faster mode can be used" Amdahls Law

defines the speedup that can be gained b! using a particular feature" #hat is speedup$ %uppose thatwe can ma&e an enhancement to a computer that will improve performance when it is used"

Amdahls Law defines thespeedup that can be gained b! using a particular feature" #hat is speedup$

%uppose that we can ma&e an enhancement to a computer that will improve performance when it is

used" %peedup is the ratio

%peedup = 'erformance for entire tas& using the enhancement when possible

'erformance for entire tas& without using the enhancement

Alternativel!

%peedup = Hecution time for entire tas& without using the enhancement

Hecution time for entire tas& using the enhancement when possible

%peedup tells us how much faster a tas& will run using the computer with the enhancement as

opposed to the original computer"

Amdahls Law gives us a +uic& wa! to find the speedup from some enhancement, which depends on

two factors3

1" *he fraction of the computation time in the original computer that can be converted to ta&e

advantage of the enhancementK4or eample, if 20 seconds of the eecution time of aprogram that ta&es F0 seconds in total can use an enhancement, the fraction is 20-F0" *his

value, which we will call 4raction enhanced, is alwa!s less than or e+ual to 1"

2" *he improvement gained b! the enhanced eecution mode that is, how much faster the tas& would

run if the enhanced mode were used for the entire programK *his value is the time of the original

mode over the time of the enhanced mode" f the enhanced mode ta&es, sa!, 2 seconds for a portion

of the program, while it is seconds in the original mode, the improvement is -2" #e will call this

value, which is alwa!s greater than 1, %peedup enhanced"

*he eecution time using the original computer with the enhanced mode will be the time spent using

the unenhanced portion of the computer plus the time spent using the enhancement3


8/20

1. a. E;'a"# tr#%) "# /*r@ #r89@ /)t a#% t/'/89 "# "#t8rat% "r-"t) *"t& ;a$'.

E#r89 a#% P/*r *"t&"# a M"r/r/))/r

4or /.% chips, the traditional primar! energ! consumption has been in switching transistors, alsocalled dynamic energy. *he energ! re+uired per transistor is proportional to the product of the

capacitive load driven b! the transistor and the s+uare of the voltage3

*his e+uation is the energ! of pulse of the logic transition of 0O1O0 or 1O0O1" *he energ! of a

single transition (0O1 or 1O0) is then3

*he power re+uired per transistor is Iust the product of the energ! of a transition multiplied b! the

fre+uenc! of transitions3

4or a fied tas&, slowing cloc& rate reduces power, but not energ!" /learl!, d!namic power and

energ! are greatl! reduced b! lowering the voltage, so voltages have dropped from 7 to Iust under

17 in 20 !ears" *he capacitive load is a function of the number of transistors connected to an output

and the technolog!, which determines the capacitance of the wires and the transistors"

E;a$' %ome microprocessors toda! are designed to have adIustable voltage, so a 1P

reduction in voltage ma! result in a 1P reduction in fre+uenc!" #hat would be the impact on

d!namic energ! and on d!namic power$

A#)*r %ince the capacitance is unchanged, the answer for energ! is the ratio of the voltages

since the capacitance is unchanged3


9/20

ntegrated circuit costs are becoming a greater portion of the cost that varies between computers,

especiall! in the highvolume, costsensitive portion of the mar&et" ndeed, with personal mobile

devices increasing reliance of wholesystems on a chip (%./), the cost of the integrated

circuits is much of the cost of the ';" *hus, computer designers must understand the costs of chips

to understand the costs of current computers" Although the costs of integrated circuits have dropped

eponentiall!, the basic process of silicon manufacture is unchanged3 A wafer is still tested andchopped into dies that are pac&aged)" *hus, the cost of a pac&aged integrated circuit is

n this section, we focus on the cost of dies, summari


10/20

S2: moe R1, R!

IO dependence:

Jead and write are -. statements" -. dependence occurs not because the same

variable is involved but because the same file is referenced b! both -. statements"

!nknown dependence:

*he dependence relation between two statements cannot be determined in the

following situations"

"he su#script o a aria#le itsel su#scri#ed.

"he su#script does not contain the loop index aria#le.

A aria#le appears more than once with su#scripts haing di$erent

coe%cients o the loop aria#le. "he su#script is nonlinear in the loop index aria#le.

#hen one or more of these conditions eists, a conservative assumption is to claim

un&nown dependence among the statements involved"

"". "#% a'' t& tr- %#%#)@ /-t-t %#%#) a#% a#t"%#%#) a#% '"$"#at t&

/-t-t %#%#) a#% a#t"%#%#) 39 r#a$"#8.

!/r5"0"100""1

YF"F" C JS1J

F"F" C JS2J

F"YF" C JSJ

YF" C < YF" JS4J

A#)*r

*he following dependences eist among the four statements3

1" *here are true dependences from %1 to %5 and from %1 to %6 because of QiB" *hese are not

loop carried, so the! do not prevent the loop from being considered parallel" *hese

dependences will force %5 and %6 to wait for %1 to complete"

2" *here is an antidependence from %1 to %2, based on XiB"

5" *here is an antidependence from %5 to %6 for QiB"

6" *here is an output dependence from %1 to %6, based on QiB"

*he following version of the loop eliminates these false (or pseudo) dependences3


11/20

After the loop, the variable X has been renamed X1" n code that follows the loop, the compiler can

simpl! replace the name X b! X1" n this case, renaming does not re+uire an actual cop! operation

but can be done b! substituting names or b! register allocation" n other cases, however, renaming

will re+uire cop!ing"

14. a. E;'a"# t/r ar&"tt-r *"t& #at %"a8ra$ a#% 8" t& )-"ta3' ;a$'

#e begin with a vector processor consisting of the primar! components that 4ig shows" *his

processor, which is loosel! based on the /ra!1, is the foundation for discussion throughout this

section" #e will call this instruction set architecture 7'% its scalar portion is '%, and its vectorportion is the logical vector etension of '%" *he rest of this subsection eamines how the basic

architecture of 7'% relates to other processors"


12/20

*he primar! components of the instruction set architecture of 7'% are the following3

Vt/r r8")tr)KHach vector register is a fiedlength ban& holding a single vector" 7'%

has eight vector registers, and each vector register holds F6 elements, each F6 bits wide" *he

vector register file needs to provide enough ports to feed all the vector functional units" *hese

ports will allow a high degree of overlap among vector operations to different vectorregisters" *he read and write ports, which total at least 1F read ports and 8 write ports, are

connected to the functional unit inputs or outputs b! a pair of crossbar switches"

Vt/r !-#t"/#a' -#"t)KHach unit is full! pipelined, and it can start a new operation on

ever! cloc& c!cle" A control unit is needed to detect ha


13/20

their parallel algorithms"

&ith these re'uests in mind, the (ermi team designed a processor that greatl)increases rawcompute horsepower, and through architectural innoations, also o$ers dramaticall)increasedprogramma#ilit) and compute e%cienc). "he *e) architectural highlights o (ermi are:

"hird #eneration $treamin% &ultiprocessor '$&( !2 +A cores per S, /x oer 0"2

x the pea* dou#le precision foating point perormance oer 0"2

ual &arp Scheduler simultaneousl) schedules and dispatches instructions

rom two independent warps

3/ 45 o RA with a con6gura#le partitioning o shared memor) and 71 cache

$econd #eneration )arallel "hread Execution I$A ni6ed Address Space with (ull +88 Support 9ptimied or 9pen+7 and irect+ompute (ull ;EEE erormance through >redication

Impro*ed &emor+ $u,s+stem ?@;;A >arallel ata+ache" hierarch) with +on6gura#le 71 and ni6ed 72

+aches (irst 0> with E++ memor) support

0reatl) improed atomic memor) operation perormance

I/IA #i%a"hread"& En%ine

1x aster application context switching

+oncurrent *ernel execution

9ut o 9rder thread #loc* execution

ual oerlapped memor) transer engines


14/20


15/20

"". H/* $-'t"' 'a#) -)% !/r 39/#% /# '$#t r '/ a#% ;'a"# &/* t/ &a#%'"#8

'//) #/t -a' t/ 64.

B9/#% O# E'$#t r C'/ C9'

A critical advantage of a vector instruction set is that it allows software to pass a large amount of

parallel wor& to hardware using onl! a single short instruction" A single vector instruction can

include scores of independent operations !et be encoded in the same number of bits as a

conventional scalar instruction" *he parallel semantics of a vector instruction allow an

implementation to eecute these elemental operations using a deepl! pipelined functional unit, as in

the 7'% implementation weve studied so far an arra! of parallel functional units or a

combination of parallel and pipelined functional units" 4igure 6"6illustrates how to improve vector

performance b! using parallel pipelines to eecute a vector add instruction"
https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-4/figure-4-4https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-4/figure-4-4https://www.inkling.com/read/computer-architecture-hennessy-5th/chapter-4/figure-4-4


16/20

Ha#%'"#8 L//) N/t E-a' t/ 64

A vector register processor has a natural vector length determined b! the number of elements

in each vector register" *his length, which is F6 for 7'%, is unli&el! to match the real vector

length in a program" oreover, in a real program the length of a particular vector operation is often

un&nown at compile time" n fact, a single piece of code ma! re+uire different vector lengths" 4oreample, consider this code3"

*he si


17/20

17.a.". E;'a"# "# %ta"' a3/-t Gra&") Pr/))"#8 -#"t.

:' with hundreds of parallel floatingpoint units, which ma&es highperformance

computing more accessible" *he interest in :' computing blossomed when this potential was

combined with a programming language that made :'s easier to program" Gence, man!

programmers of scientific and multimedia applications toda! are pondering whether to use :'s or

/'s"Pr/8ra$$"#8 t& GPU

/;A is an elegant solution to the problem of representing parallelism in algorithms, not all

algorithms, but enough to matter" t seems to resonate in some wa! with the wa! we thin& and code,

allowing an easier, more natural epression of parallelism be!ond the tas& level"

*he :' hardware handles parallel eecution and thread management it is not done b!

applications or b! the operating s!stem" *o simplif! scheduling b! the hardware, /;A re+uires that

thread bloc&s be able to eecute independentl! and in an! order" ;ifferent thread bloc&s cannot

communicate directl!, although the! can coordinate using atomic memor! operations in :lobal

emor!"As we shall soon see, man! :' hardware concepts are not obvious in /;A" *hat is a

good thing from a programmer productivit! perspective, but most programmers are using :'s

instead of /'s to get performance" 'erformance programmers must &eep the :' hardware in

mind when writing in /;A" 4or reasons eplained shortl!, the! &now that the! need to &eep

groups of 52 threads together in control flow to get the best performance from multithreaded %;

'rocessors, and create man! more threads per multithreaded %; 'rocessor to hide latenc! to

;JA" *he! also need to &eep the data addresses locali


18/20


19/20

17.3. ".H/* *"'' 9/- %tt a#% #&a# '// '' ara'''")$+

Looplevel parallelism is normall! anal!100i=i?1)

@

Ai?1B=AiB?/iB -C%1C-

Di?1B=DiB?AiB -C%2C-

E#hat are the dependences between %1 and %2 in the loop$

A#)*r

*here are two different dependences3

1" %1 uses a value computed b! %1 in an earlier iteration, since iteration i computes Ai?1B,

which is read in iteration i?1" *he same is true of %2 for DiB and Di?1B"

2" %2 uses the value, Ai?1B, computed b! %1 in the same iteration"


20/20

*hese two dependences are different and have different effects" *o see how the! differ, lets assume

that onl! one of these dependences eists at a time" Decause the dependence of statement %1 is on an

earlier iteration of %1, this dependence is loop carried" *his dependence forces successive iterations

of this loop to eecute in series"

*he second dependence (%2 depending on %1) is within an iteration and is not loop carried" *hus, ifthis were the onl! dependence, multiple iterations of the loop could eecute in parallel, as long as

each pair of statements in iteration were &ept in order" #e saw this t!pe of dependence in an eample

in %ection 5"2, where unrolling was able to epose the parallelism"

t is also possible to have a loopcarried dependence that does not prevent parallelism, as the net

eample shows

cs7103 multi core architreture cycle test

Documents