alexandre david 1.2.05...
TRANSCRIPT
![Page 2: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/2.jpg)
08-02-2010 MVP'10 - Aalborg University 2
How much do we need to know?
Important to know the architecture of parallel hardware.Not all details are important to programmers
keep portabilitykeep up with technological changes
The point: Get a meaningful model.
![Page 3: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/3.jpg)
08-02-2010 MVP'10 - Aalborg University 3
Intel Core Duo
cache coherence protocol ModifiedExclusiveSharedInvalid
more shared L2low latency
![Page 4: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/4.jpg)
08-02-2010 MVP'10 - Aalborg University 4
ExampleThe point: Relatively expensive.
![Page 5: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/5.jpg)
08-02-2010 MVP'10 - Aalborg University 5
AMD Dual Core Opteron
cache coherence protocol
ModifiedOwnedExclusiveSharedInvalid
easier for SMP
![Page 6: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/6.jpg)
08-02-2010 MVP'10 - Aalborg University 6
Core i7
![Page 7: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/7.jpg)
08-02-2010 MVP'10 - Aalborg University 7
SMP
caches “snoop” on the bus bottleneck
![Page 8: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/8.jpg)
08-02-2010 MVP'10 - Aalborg University 8
Larger SMP – Sun Fire18 boards connectedby a crossbarswitch.Snooping buses.Directory based cachecoherence protocol.Scalable/higherlatency.
Note: Expensivehardware.
![Page 9: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/9.jpg)
08-02-2010 MVP'10 - Aalborg University 9
CrossbarN x N connections.Expensive, limited.
![Page 10: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/10.jpg)
08-02-2010 MVP'10 - Aalborg University 10
Heterogeneous chipsGPUs
800 ALU on ATI’s latest 4800 series.--logic, ++computational units
FPGAsPCI boards availablereconfigurable
CellDual-threaded PPC – PPU, 64 bits8x SPU
![Page 11: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/11.jpg)
08-02-2010 MVP'10 - Aalborg University 11
Cell architecture
18.2GB/s128 bits
No cache coherenceprotocol.
Different philosophy:the PPU is a coordinator,the SPUs do the job.
Difficult to program.
![Page 12: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/12.jpg)
08-02-2010 MVP'10 - Aalborg University 12
Clusters“Cheap” PCs connected together.
GB ethernetInfiniband…Memory private to each machine,use message based communication.Scalable but high latency.Sold by racks.
![Page 13: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/13.jpg)
08-02-2010 MVP'10 - Aalborg University 13
BlueGene
65536 x@ 700MHz
interesting part
![Page 14: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/14.jpg)
08-02-2010 MVP'10 - Aalborg University 14
Interconnect
3-D torus for standarddata transfers.
Collective network forfast reductions.Very powerful.
![Page 15: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/15.jpg)
08-02-2010 MVP'10 - Aalborg University 15
Broadcast/Reduction
Broadcast Reduce
![Page 16: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/16.jpg)
08-02-2010 MVP'10 - Aalborg University 16
Cut-through routingSimplified packet routing:
Packets take the same path(1x routing information).In sequence packet delivery (no sequencing).Error detection at message level, cheap detection (for good networks).Fixed size unit for packets = flow control digits (flits).
![Page 17: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/17.jpg)
08-02-2010 MVP'10 - Aalborg University 17
![Page 18: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/18.jpg)
08-02-2010 MVP'10 - Aalborg University 18
LessonsVery different architectures.
SMPDistributed
But we want one meaningful model.Hints:
local accesses - cheapnon-local accesses - expensive
![Page 19: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/19.jpg)
08-02-2010 MVP'10 - Aalborg University 19
RAM modelSequential execution unit with unbounded memory.
every operation takes 1 unit of time
Limitedok for algorithms – reason on complexityunrealistic
![Page 20: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/20.jpg)
08-02-2010 MVP'10 - Aalborg University 20
Application of the RAM model
Expected: O(n), O(log n)(array must be sorted)
update of location missing
![Page 21: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/21.jpg)
08-02-2010 MVP'10 - Aalborg University 21
PRAM modelSeveral execution units accessing one shared unbounded memory
global accesssynchronous access – one global clockcontention resolved by pre-defined rules
EREW, CREW, CRCW, ERCWleast powerful, least convenient: EREWmost powerful, most convenient: CRCWlesson: reason on CRCW but apply on EREW because it is possible to simulate one with the other (in polynomial time)
like RAM: good for algorithms, complexity…
![Page 22: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/22.jpg)
08-02-2010 MVP'10 - Aalborg University 22
CTA (Candidate Type Architecture)
Account for communication costs.Applies to clusters & SMPs.Local/non-local accesses.Goal: Achieve in practice the predicted running time. PRAM is misleading in that respect.The catch: Not easy to estimate communication costs.
Model:interconnected processors with RAMtopology not specified but this impacts communication costs.
![Page 23: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/23.jpg)
08-02-2010 MVP'10 - Aalborg University 23
CTA
SMPClusterCell…Memory latencyspecified infunction of thereal architecture.Non-local: λ.
![Page 24: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/24.jpg)
08-02-2010 MVP'10 - Aalborg University 24
Typical λ
![Page 25: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/25.jpg)
08-02-2010 MVP'10 - Aalborg University 25
LessonUse locality
temporal & spatialsometimes redundant computation is better than sending data around
Exact number of processors supplied at runtime.
scale/not tied to one setupNote: λ increases with P.
![Page 26: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/26.jpg)
08-02-2010 MVP'10 - Aalborg University 26
Memory reference mechanismsShared memory
avoid race conditions, needs synchronization
One-sidednot commonprivate (local) & shared non-coherent memory
Message passing – 2-sidedMPIComplex communication protocols.
![Page 27: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/27.jpg)
08-02-2010 MVP'10 - Aalborg University 27
Memory consistency modelsSequential consistency – expensive.
serialize the operations of all processorsoperations obey specified order
Relaxed consistency – weaker.variations
Keep in mind: There are hardware tricks to get sequential consistency (CAS/TAS).
![Page 28: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/28.jpg)
Interconnects
![Page 29: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/29.jpg)
08-02-2010 MVP'10 - Aalborg University 29
Bus Based Networks
No local cache
Local cache
Serialize accesses – cheap.
![Page 30: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/30.jpg)
08-02-2010 MVP'10 - Aalborg University 30
Crossbar Networks
Parallel access – expensive.
![Page 31: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/31.jpg)
08-02-2010 MVP'10 - Aalborg University 31
Omega networksMulti-stage network – compromise cost/performance.N nodes – log n stages.
![Page 32: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/32.jpg)
08-02-2010 MVP'10 - Aalborg University 32
Linear Arrays and Meshes
![Page 33: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/33.jpg)
08-02-2010 MVP'10 - Aalborg University 33
Hypercubes
2^d nodes,d=dimension,good routing,relatively expensive,low congestion
![Page 34: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/34.jpg)
08-02-2010 MVP'10 - Aalborg University 34
Fat trees
More bandwidth where it is needed.
![Page 35: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/35.jpg)
08-02-2010 MVP'10 - Aalborg University 35
Evaluating The NetworksAll the previous topologies have advantages and disadvantages.Important factors: cost and performance.Define criteria to characterize cost and performance.
![Page 36: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing](https://reader035.vdocuments.net/reader035/viewer/2022071115/5ffbc45876f26a2bfc4a4bf4/html5/thumbnails/36.jpg)
08-02-2010 MVP'10 - Aalborg University 36
CriteriaDiameter: maximum distance pa ↔ pb.Connectivity: measure of multiplicity of paths.Bisection width: minimum number of links to cut in order to partition the network in 2 equal halves.Bisection bandwidth: minimum volume of communication allowed between 2 halves.Cost: number of communication links, i.e., wires.