simd computer organizations

8/12/2019 SIMD Computer organizations

1/20


2/20

Con"i$uration is structured 'ith N synchronied Ps,all o" 'hich are under the controlo" one CU.ach Pi is essentially an !/U 'ith attached 'or(in$ re$isters and localmemory PMi"or the stora$e o" distri&uted data.The CU also has its o'n main memory

"or stora$e o" pro$rams.The "unction o" CU is to decode all instruction and determine'here the decoded instructions should &e executed.Scalar or control type instructions aredirectly executed inside the CU.Vector instructions are &roadcasted to the Ps "ordistri&uted execution.

Con"i$uration II di""ers "rom con"i$uration I in t'o aspects.0irst the local memriesattached to the Ps are replaced &y parallel memory modules shared &y all the Ps


3/20

throu$h an alio$nment net'or(.Second the inter P permutation net'or( is replaced &yinter P memory ali$nment net'or(.! $ood example o" con"i$uration II SIMD machineis 1urrou$hs scienti"ic processor.

0ormally an SIMD computer is characteried &y "ollo'in$ set o" parameters.

C23N,0,I,M4

N2Num&er o" Ps in the system.

02! set o" data routin$ "unctions.

I2Set o" machine instructions

M2Set o" mas(in$ schemes.

Masking and data routing mechanisms

In an array processor, ector operands can &e speci"ied &y the re$isters to &e used or &ythe memory addresses to &e re"erenced. 0or memory-re"erence instructions, each Piaccesses the local PM5, o""set &y its o'n index re$ister Ii, TheIi re$ister modi"ies the$lo&al memory address &roadcast "rom the CU. Thus, di""erent locations in di""erentPMs can &e accessed simultaneously 'ith the same $lo&al address speci"ied &y the CU.The "ollo'in$. example sho's ho' indexin$ can &e used to address the local memoriesin parallel at di""erent local addresses.xample 6. Consider an array o" n x n data elements7A 2 {!8i,5),0< i,5 3 n - I9lements in the 5th column o"A are stored in n consecutie locations o" PM 5, say "romlocation :: to location :: ; n - 8assume n 3N). I" the pro$rammer 'ishes to accessthe principal dia$onal elements A(j,5) "or 5 2 :, . . . n - o" the arrayA, ,..., n - in order to ensure the parallel access o" the entire ro'.

xample 6.> To illustrate the necessity o" data routin$ in an array processor, 'e sho'the execution details o" the "ollo'in$ ector instruction in an array o"N Ps. The sumS(k) o" the "irst k components in a ectorA is desired "or each ( "rom : to n - I. /etA 2

(Ao,A1.. ,An-I)=e need to compute the "ollo'in$ n summations7S(k) 2 iAi "or k 2 :, , . . ., n -

These n ector summations can &e computed recursiely &y $oin$ throu$h the ? "ollo'in$n - iterations de"ined &y7 .

S8@) 2Ao


4/20

S(k) 2 S(k - ) ;Ak "or k 2 ,>, . . ., n - 86.A)

The a&oe recursie summations "or the case o" n 2 B are implemented in an arrayprocessor 'ithN 2 B Ps in lo$>n 2 steps.. 1oth data routin$ and P mas(in$ are usedin the implementation. Initially, eachAi, residin$ in PMi, is moed to theRi re$ister in

Pi "or i 2 :, . . . n - (n 2N 2 B is assumed here). In the "irst step,Ai is routed "romRjtoRj+ and added toAi+ 'ith the resultin$ sumAi ;Ai+ in Ri+ l "or i 2 :, , . . . ,. The arro's in 0i$ure 6. sho' the routin$ operations and the shorthand notation i - 5 isused to re"er to the intermediate sum !i;Ai+ ; .. . ;Aj. In step >, the intermediate sumsinRi are routed to Ri+2 "or i 2 : to 6. In the "inal step, the intermediate sums inRi arerouted to Ri+4 "or i 2 : to . Conse#uently, P. has the "inal alue o" S(k) "or k 2:,,>, ...,E.

!s "ar as the data-routin$ operations are concerned, PE is not inoled 8receiin$ &utnot transmittin$) in step . PE and P are not inoled in step >. !lso PE, P, P6,and PA are not inoled in step . These un- 'anted Ps are mas(ed o"" durin$ thecorrespondin$ steps. Durin$ the addition operations, P: is disa&led in step F P: andPl are made inactie in step >F and P:, Pl, P>, and P are mas(ed o"" in step .

The Ps that are mas(ed o"" in each step depend on the operation 8data-routin$ or arith-metic-addition) to &e per"ormed. There"ore, the mas(in$ patterns (eep chan$in$ in thedi""erent operation cycles, as demonstrated &y the example. Note that the mas(in$ androutin$ operations 'ill &e much more complicated 'hen the ector len$th n 4N.

!rray processors are special-purpose computers "or limited scienti"ic applica tions. Thearray o" Ps are passie arithmetic units 'aitin$ to &e called "or parallel-computationduties. The permutation net'or( amon$ Ps is under pro$ram control "rom the CU.*o'eer, the principles o" P mas(in$, $lo&al ersus local indexin$, and datapermutation are not much chan$ed in the di""erent machines.


5/20

Inter PE communications

There are "undamental decisions in desi$nin$ appropriate architecture o" aninterconnection net'or( "or an SIMD machine.The decisions are made &et'eenoperation modes,control strate$ies,s'itchin$ methodolo$ies,and net'or( topolo$ies.

Operation Mode:

The types o" communication can &e identi"ied 7Synchronous and asunchronous.

Control strategy7

The control settin$ "umctions can &e mana$ed &y a centralied controller or &y indiiduals'itchin$ element.The later strate$y is called distri&uted controland the "irst strate$ycorresponds to centralied control.

Switching Methodology:

The t'o ma5or s'itchin$ methodolo$ies are circuit s'itchin$ and pac(et s'itchin$.

Network topology7

The topolo$ies can &e $rouped into t'o cate$ories7static and dynamic.In static topolo$y

dedicated &uses cannot &e recon"i$ured.1ut lin(s in dynamic cate$ory can &erecon"i$ured.

SIMD Interconnection Networks

Various interconnection net'or(s hae &een su$$ested "or SIMD computers. Theclassi"ication includes static ersus dynamic net'or(s, Mesh-connected Illiac net'or(,Cu&e interconnection net'or(s, 1arrel shi"ter and data manipulator, Shu""le exchan$eand ome$a net'or(s. *ere 'e 'ill discuss the "irst three net'or(s.

Static ersus dynamic networks

The topolo$ical structure o" an array processor is mainly characteried &y the dataroutin$ net'or( used in interconnectin$ the processin$ elements.Such net'or( can &especi"ied &y a set o" data routin$ "unctions.


6/20

Static networks

Topolo$ies in static net'or( can &e classi"ied accordin$ to the dimensions re#uired "orlayout. xamples "or one dimensional topolo$ies include linear array.T'o dimensionaltopolo$y include rin$, star, tree, mesh, and systolic array. Three dimensional topolo$ies

include completely connected chordal rin$, cu&e, and cu&e connected cyclenet'or(s.

Dynamic networks

T'o classes o" dynamic net'or(s are there. Sin$le sta$e ersus multista$e.


7/20

Single stage networks

! sin$le sta$e net'or( is a s'itchin$ net'or( 'ith N input selectors 8IS) and N outputselectors 8@S).ach IS is essentially a to D demultiplexer and each @S is an M to multiplexer 'here 3D3N and 3M3N.! sin$le sta$e net'or( 'ith D2M2N is across&ar s'itchin$ net'or(. To esta&lish a desired connectin$ path di""erent path controlsi$nals 'ill &e applied to all IS and @S selectors.

! sin$le sta$e net'or( is also called a reciculatin$ net'or(. Data items may hae toreirculate throu$h the sin$le sta$e seeral times &e"ore reachin$ the "inal destination.The


8/20

num&er o" recirculations needed depend on the connectiity in the sin$le sta$enet'or(.In $eneral,hi$her is the hard'are connectiity,the less is the num&er o"recirculations.

Multi stage networks

Many sta$es o" an interconnected s'itch "orm the multista$e net'or(.They are descri&ed&y three characteriin$ "eatures 7s'itch &ox,net'or( topolo$y and control structure.Manys'itch &oxes are used in multista$e net'or(sach &ox is essentially an interchan$e

deice 'ith t'o inputs and outputs.The "our states o" a s'itch &ox are7strai$ht,exchan$e,upper &roadcast and lo'er &roadcast.

Mesh!Connected Illiac Network

! sin$le sta$e recirculatin$ net'or( has &een implemented in the Illiac IV arrayprocessor 'ith A Ps.ach Piis allo'ed to send data to P I;,P i-,P,i;rPi-r'herer2GN.

0ormally Illiac net'or( is characteried &y "ollo'in$ "our routin$ "unctions.

H;8i)28i;)mod NH-8i)28i-)mod N

H;r8i)28i;r)mod N

H-r8i)28i-r)mod N


9/20

! reduced Illiac net'or( is illustrated in 0i$ 'ith N2 and r2A.

H;28: > N-)

H-28N-> :)

H;A28: A B >)8 6 J )8> : A )8 E 6)

H-A28> B A :)8 J 6 )8A : > )86 E )

This "i$ sho's "our Ps can &e reached "rom any P in one step,seen Ps in t'osteps,and eleen Ps in three steps.In $eneral it ta(e I steps to route data "rom Pito anyother P5in an Illiac net'or( o" sie N 'here I is upper &ouded &y I3 GN-


10/20


11/20


12/20

P#$#%%E% #%&O$I'(MS )O$ #$$#* P$OCESSO$S

The ori$inal motiation "or deelopin$ SIMD array processors 'as to per"orm parallelcomputations on ector or matrix types of data. Parallel processin$ al$orithms hae &eendeeloped &y many computer scientists "or SIMD computers. Important SIMDal$orithms can &e used to per"orm matrix multiplication, "ast 0ourier trans"orm 800T),matrix transposition, summation of ector elements, matrix inersion, parallel sortin$,linear recurrence, &oolean matrix operations, and to sole partial di""erential e#uations.=e study &elo' seeral representatie SIMD al$orithms "or matrix multiplication,parallel sortin$, and parallel 00T. =e shall analye the speedups of these parallelal$orithms oer the se#uential al$orithms on SISD computers. The implementation ofthese parallel al$orithms on SIMD machines is descri&ed &y concurrnt !/L@/. Thephysical memory allocations and pro$ram implementation depend on the speci"icarchitecture of a $ien SIMD machine.

SIMD Matri+ Multiplication

Many numerical pro&lems suita&le "or parallprocessin$ can &e "ormulated as matrixcomputations. Matrix manipulation is "re#uently needed in solin$ linear systems of

e#uations. Important matrix operations include matrix multiplication, /-Udecomposition, and matrix inersion. =e present &elo' t'o parallel al$o- rithms "ormatrix multiplication. The di""erences &et'een SISD and SIMD matrix al$orithms arepointed out in their pro$ram structures and speed per"ormances. In $eneral, the inner loopof a multileel SISD pro$ram can &e replaced &y one or more SIMD ector instructions./etA 2 ai( and! 2 &(O&e n x n matrices. The multiplication of A and! $enerates aproduct matrix C 2A x! 2 Ci5 of dimension n x n. The elements of the product matrix Cis related to the elements of A and!&y7

Ci52ai(x &(5 86.>>)

There are n cumulatie multiplications to &e per"ormed in #. 6.>>. ! cumulatiemultiplication re"ers to the lin(ed multiply-add operation c2 c ; a x ". The addition ismer$ed into the multiplication &ecause the multiply is e#uialent to multioperand

addition. There"ore, 'e can consider the unit time as the time re#uired to per"orm onecumulatie multiplication, since add and multiply are per"ormed simultaneously.In a conentional SISD uniprocessor system, the n cumulatie multiplications arecarried out &y a serially coded pro$ram 'ith three leels of D@ loops correspondin$ tothree indices to &e used. The time complexity of this se#uential pro$ram is proportional ton#, as speci"ied in the "ollo'in$ SISD al$orithm "or matrix multiplication.

#n O(n3) algorithm ,or SISD matri+ multiplication

0or i 2 I to n Do0orj 2 I to n Do

$ij 2 : 8initialiation)

0or k 2 I to n Do$i52Ci5;ai( &i5 8scalar additie multiply)


13/20

nd o" k loopnd o"j loopnd o" i loop

No', 'e 'ant to implement the matrix multiplication on an SIMD computer 'ith n Ps.The al$orithm construct depends heaily on the memory allocations o" theA, !, and Cmatrices in the PMs. Column ectors are then stored 'ithin the same PM. Thismemory allocation scheme allo's parallel access o" all the elements in each ro' ectoro" the matrices. 1ased in this data -distri&ution, 'e o&tain the "ollo'in$ parallelal$orithm. The t'o parallel do opera tions correspond to %ctor &oad "or initialiation and%ctor 'u&ti& "or the inner loop o" additie multiplications. The time complexity has&een reduced to *(n2).There"ore, the SIMD al$orithm is n times "aster than the SISD al$orithm "or matrixmultiplication.

#n O(n) algorithm ,or SIMD matri+ multiplication

0or i2 I to n Do

Par "or k 2 I to n Do

$ik 2 : (rctor &oad)0orj 2 I to n DoPar "or k 2 to n Do

$ik 2 $ik ; aij . "jk (%ctor 'u&ti&)nd o"j loopnd o" i loop

It should &e noted that the %ctor &oad operation is per"ormed to initialie the ro' ectors

o" matrix C one ro' at a time. In the %ctor 'u&ti& operation, thesame multiplier aij is &roadcast "rom the CU to all Ps to multiply all n elements {"ikfork 2 ,>, ..., n o" the ith ro' ector o"!. In total, n2 ector multiply operations are needein the dou&le loops.

I" 'e increase the num&er o" Ps used in an array processor to n>an @8n lo$>n) can &edeised to multiply t'o n xn matrices a and &./et n2>m.Consider an array processor'hose n>2>>mpes are located at the >>mertices o" a >m cu&e net'or(.! >m cu&e net'or(can &e considered as t'o 8>m-) cu&e net'or(s lin(ed to$ether &y 2' extra ed$es. In0i$ure a A-cu&e net'or( is constructed "rom t'o -cu&e net'or(s &y usin$ B extra ed$es&et'een correspondin$ ertices at the corner positions. 0or clarity, 'e simpli"y the A-cu&e,dra'in$ &y sho'in$ only one o" the ei$ht "ourth dimension connections. The

remainin$ connections are implied.


14/20

/et 8P>m-l P>m->... Pm Pm-l. .,PI P@)>)&e the P address in the 2' cu&e. =e can achiee the

@8n lo$> n) compute time only i" initially the matrix elements are "aora&ly distri&uted in

the P ertices. The n ro's o" matrix A are distri&uted oer n distinct Ps 'hose

addresses satis"y the condition

P2m-lP2m-l...Pm =Pm-lPm-2.

as demonstrated in 0i$ure .20a "or the initial distri&ution o" "our ro's o" the matrixA in

a A x A matrix multiplication (n 2 A, ' 2 >). The "our ro's o"A are then &roadcast oer

the "ourth dimension and "ront to &ac( ed$es, as mar(ed &y

ro' num&ers in 0i$ure a.


15/20

The n columns o" matrix! 8or the n ro's o" matrix !) are eenly distri&uted oer the

Ps o" the 2' cu&es, as illustrated in 0i$ure .2*c. The "our ro's o" ! are then

&roadcast oer the "ront and &ac( "aces, as sho'n in 0i$ure .20d. 0i$ure 6.> sho's the

com&ined results o"A and! &roadcasts 'ith the inner product ready to &e computed.

The n-'ay &roadcast depicted in 0i$ure .20" and .20d


16/20

ta(es lo$ n steps, as illustrated in 0i$ure 6.> in ' 2 lo$> n 2 lo$>A 2 > steps.

The matrix multiplication on a 2'-cu" net'or( is "ormally speci"ied &elo'

-.Transpose 1 to "orm 1t oer the m cu&es x>m-.+m:..:.

>.N-'ay &roadcast each ro' o" 1tto all pes in the m cu&e.

.N-'ay &roadcast each ro' o" !

A.ach P no' contain a ro' o" ! and a column o" 1.

Parallel Sorting on #rray Processors

!n SIMD al$orithm is to &e presented "or sortin$ n> elements on a mesh-connected

8llIiac-lV-Ii(e) processor array in *(n) routin$ and comparison steps. This sho's a

speedup o" @8lo$> n) oer the &est sortin$ al$orithm, 'hich ta(es *(n lo$> n) steps on a

uniprocessor system. =e assume an array processor 'ith N 2 n> identical Ps

interconnected &y a mesh net'or( similar to llIiac-IV except that the Ps at the perimeter

hae t'o or three rather than "our nei$h&ors. In other 'ords, there are no raaround

connections in this simpli"ied mesh net'or(.

liminatin$ the 'raparound connections simpli"ies the array-sortin$ al$orithm. The timecomplexity o" the array-sortin$ al$orithm 'ould &e a""ected &y, at most, a "actor o" t'o i"the 'raPiiround connections 'ere included.

T'o time measures are needed to estimate the time complexity o" the parallel sortin$al$orithm. /et t R &e the routin/ ti' re#uired to moe one item "rom a P to one o" itsnei$h&ors, and tc &e the co'arion ti' re#uired "or one comparison step. Concurrentdata routin$ is allo'ed. Up toN comparisons may &e per"ormed simultaneously. Thismeans that a comparison-interchan$e step &et'een t'o items in ad5acent Ps can &e donein 2tR ; tc time units 8route le"t, compare, and. route ri$ht). ! mixture o" horiontal andertical comparison interchan$es re#uires at least 4tR ; tc time units.

The sortin$ pro&lem depends on the indexin$ schemes on the Ps. The Ps may &eindexed &y a &i5ection "rom {1,2,...,n x {1,2,...,nto{0,1,...,N-1, 'here N 2 n2. Thesortin$ pro&lem can &e "ormulated as the moin$ o" the 5th smallest element in the Parray "or all 5 2 :, , >,...,N - . Illustrated in 0i$ure are three indexin$ patterns "ormeda"ter sortin$ the $ien array in part a 'ith respect to three


17/20

Ps. The pattern in part " corresponds to a ro-'ajord indin/, part c corresponds to auffid ro-'ajor indexin$, and is &ased on a nak-&ik ro-'ajor indin/. Thechoice o" a particular indexin$ scheme depends upon ho' the sorted elements 'ill &eused. =e are interested in desi$nin$ sortin$ al$orithms 'hich minimie the total routin$and comparison steps.

The lon$est routin$ path on the mesh in a sortin$ process is the transposition o" t'oelements initially loaded at opposite corner Ps, as illustrated in 0i$ure 6.>A. This

transposition needs at least 4(n - ) routin$ steps. This means that no al$orithm can sortn2 elements in a time o" less than *(n). In other 'ords, an *(n) sortin$ al$orithm isconsidered optimal on a mesh o" n2 Ps. 1e"ore 'e sho' one such optimal sortin$al$orithm on the mesh-connected Ps, let us reie' 1atcher su&arrays to "orm a sortedj-"-k array, 'here 5 and k are po'ers o" > and k 4 .

The sna(eli(e ro'-ma5or orderin$ is assumed in all the arrays. In the de$enerate case o"3(&, >), a sin$le comparison-interchan$e step is su""icient to sort t'o unit su&arrays.Lien t'o sorted columns o" len$th 5 R >, the M85, >) al$orithm consists o" the "ollo'in$steps7

xample 6.7 The M85, >) sortin$ al$orithm

O 7 .Moe all odds to the le"t column and all eens to the ri$ht in 2tk time.


18/20

O>7 Use the odd-%n tranoition ort to sort each column in 2jtk ;jtc time.

O7 Interchan$e on een ro' in 2tk time.

OA7 Per"orm one comparison-interchan$e in 2tk ; tc time

'he M/01k2 algorithm

. .Per"orm sin$le interchan$e step on een ro's>. .Unshu""le each ro'.. .Mer$e &y callin$ al$orithm m85,(%>)A. .Shu""le each ro'6. .Interchan$e on een ro'. Comparison interchan$e


19/20

#ssociatie array processing.

In this section, 'e descri&e the "unctional or$aniation o" an associatie array processorand arious parallel processin$ "unctions that can &e per"ormed on an associatieprocessor. =e classi"y associatie processors &ased on associatie- memoryor$aniations. 0inally, 'e identi"y the ma5or searchin$ applications o" associatiememories and associatie processors. !ssociatie processors hae &een &uilt only asspecial-purpose computers "or dedicated applications in the past.

!ssociatie Memory @r$aniations

Data stored in an associatie memory are addressed &y their contents. In this sense,associatie memories hae &een (no'n as content.-addressa&le memory.ara&&& arc ''or and 'u&tiacc ''or . The ma5or adanta$e o" assosiatiememory oer H!M is its capa&ility o" per"ormin$ parallel search and parallel search andparallel comparuison operations. These are "re#uently needed-in-many-imp@rtantapplications., such as the stora$e and retrieal o" rapidly chan$in$ data&ases, radar-si$nal trac(in$, ima$e processin$, computer ision, and arti"icial intelli$ence. The ma5orshortcomin$ o" associatie memory is its much increased hard'are cost. Hecently, thecost o" associatie memory is much Ihi$her than that o" H!Ms.

The structure o" !M is modeled in "i$.The associatuie memory array consists o" n 'ords'ith m&its per 'ord.ach cell in the array consists o" a "lip "lop associated 'ith some

comparison lo$ic $ates "or pattern match and read 'rite control.! &it slice is a erticalcolumn o" &it cells o" all the 'ords at the same position.

ach &it cell 1i5 can &e 'ritten in,read out,or compared 'ith an external interi$atin$si$nal.The comparand re$ister C28C,C>,..Cm) is used to hold the (ey operand&ein$ searched "or .The mas(in$ re$isterM28M,M>,..Mm) is used to ena&le the &itslices to &e inoled in the parallel comparison operations across all the 'ord in theassociatie memory.


20/20

In practice, most associatie memories hae the capa&ility o" ord ara&&& operationsFthat is, all 'ords in the associatie memory array are inoled in the parallel searchoperations. This di""ers drastically "rom the ord ria& operations encountered in H!Ms.1ased on ho' &it slices are inoled in the operation, 'e consider &elo' t'o di""erentassociatie memory or$aniations7

The &it parallel or$aniation In a &it parallel or$aniation, the comparison process isper"ormed in a parallel-&y-'ord and parallel-&y-&it "ashion. !ll &it slices 'hich are notmas(ed o"" &y the mas(in$ pattern are inoled in the comparison process. In this

or$aniation, 'ord-match ta$s "or all 'ords are used 80i$ure .#4a). ach cross point inthe array is a &it cell. ssentially, the entire array o" cells is inoled in a searchoperation.

1it serial or$aniation The memory or$aniation in 0i$ure .#4" operates 'ith one &itslice at a time across all the 'ords. The particular &it slice is selected &y an extra lo$icand control unit. The &it-cell readouts 'ill &e used in su&se#uent &it-slice operations. Theassociatie processor ST!H!N has. the &it serial memory or$aniation and the PP has&een installed 'ith the &it parallel or$aniation.The associatie memories are used mainly "or search and retrieal o" non- numericin"ormation. The &it serial or$aniation re#uires less hard'are &ut is slo'er in speed. The&it parallel or$aniation re#uires additional 'ord-match detection lo$ic &ut is "asterin

simd computer organizations

Documents