ece 1747: parallel programming basics of parallel architectures: shared-memory machines
TRANSCRIPT
![Page 1: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/1.jpg)
ECE 1747: Parallel Programming
Basics of Parallel Architectures:
Shared-Memory Machines
![Page 2: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/2.jpg)
Two Parallel Architectures
• Shared memory machines.
• Distributed memory machines.
![Page 3: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/3.jpg)
Shared Memory: Logical View
proc1 proc2 proc3 procN
Shared memory space
![Page 4: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/4.jpg)
Shared Memory Machines
• Small number of processors: shared memory with coherent caches (SMP).
• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
![Page 5: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/5.jpg)
SMPs
• 2- or 4-processors PCs are now commodity.
• Good price/performance ratio.
• Memory sometimes bottleneck (see later).
• Typical price (8-node): ~ $20-40k.
![Page 6: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/6.jpg)
Physical Implementation
proc1 proc2 proc3 procN
Shared memory
cache1 cache2 cache3 cacheN
bus
![Page 7: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/7.jpg)
Shared Memory Machines
• Small number of processors: shared memory with coherent caches (SMP).
• Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).
![Page 8: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/8.jpg)
CC-NUMA: Physical Implementation
proc1 proc2 proc3 procN
mem2 mem3 memNmem1
cache2cache1 cacheNcache3
inter-connect
![Page 9: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/9.jpg)
Caches in Multiprocessors
• Suffer from the coherence problem:– same line appears in two or more caches– one processor writes word in line– other processors now can read stale data
• Leads to need for a coherence protocol– avoids coherence problems
• Many exist, will just look at simple one.
![Page 10: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/10.jpg)
What is coherence?
• What does it mean to be shared?
• Intuitively, read last value written.
• Notion is not well-defined in a system without a global clock.
![Page 11: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/11.jpg)
The Notion of “last written” in a Multi-processor System
w(x)
w(x)
r(x)
r(x)
P0
P1
P2
P3
![Page 12: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/12.jpg)
The Notion of “last written” in a Single-machine System
w(x) w(x) r(x) r(x)
![Page 13: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/13.jpg)
Coherence: a Clean Definition
• Is achieved by referring back to the single machine case.
• Called sequential consistency.
![Page 14: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/14.jpg)
Sequential Consistency (SC)
• Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.
![Page 15: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/15.jpg)
Returning to our Example
w(x)
w(x)
r(x)
r(x)
P0
P1
P2
P3
![Page 16: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/16.jpg)
Another Way of Defining SC
• All memory references of a single process execute in program order.
• All writes are globally ordered.
![Page 17: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/17.jpg)
SC: Example 1
w(x,1) w(y,1)
r(x) r(y)
Initial values of x,y are 0.
What are possible final values?
![Page 18: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/18.jpg)
SC: Example 2
w(x,1) w(y,1)
r(y) r(x)
![Page 19: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/19.jpg)
SC: Example 3
w(x,1)
w(y,1)
r(y) r(x)
![Page 20: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/20.jpg)
SC: Example 4
w(x,1)
w(x,2)
r(x)
r(x)
![Page 21: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/21.jpg)
Implementation
• Many ways of implementing SC.
• In fact, sometimes stronger conditions.
• Will look at a simple one: MSI protocol.
![Page 22: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/22.jpg)
Physical Implementation
proc1 proc2 proc3 procN
Shared memory
cache1 cache2 cache3 cacheN
bus
![Page 23: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/23.jpg)
Fundamental Assumption
• The bus is a reliable, ordered broadcast bus.– Every message sent by a processor is received
by all other processors in the same order.
• Also called a snooping bus– Processors (or caches) snoop on the bus.
![Page 24: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/24.jpg)
States of a Cache Line
• Invalid
• Shared– read-only, one of many cached copies
• Modified– read-write, sole valid copy
![Page 25: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/25.jpg)
Processor Transactions
• processor read(x)
• processor write(x)
![Page 26: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/26.jpg)
Bus Transactions
• bus read(x) – asks for copy with no intent to modify
• bus read-exclusive(x)– asks for copy with intent to modify
![Page 27: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/27.jpg)
State Diagram: Step 0
I S M
![Page 28: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/28.jpg)
State Diagram: Step 1
I S M
PrRd/BuRd
![Page 29: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/29.jpg)
State Diagram: Step 2
I S M
PrRd/BuRdPrRd/-
![Page 30: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/30.jpg)
State Diagram: Step 3
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
![Page 31: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/31.jpg)
State Diagram: Step 4
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
![Page 32: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/32.jpg)
State Diagram: Step 5
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
![Page 33: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/33.jpg)
State Diagram: Step 6
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
![Page 34: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/34.jpg)
State Diagram: Step 7
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
BuRd/-
![Page 35: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/35.jpg)
State Diagram: Step 8
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
BuRd/-
BuRdX/-
![Page 36: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/36.jpg)
State Diagram: Step 9
I S M
PrRd/BuRdPrRd/-
PrWr/BuRdX
PrWr/BuRdX
PrWr/-
BuRd/Flush
BuRd/-
BuRdX/-
BuRdX/Flush
![Page 37: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/37.jpg)
In Reality
• Most machines use a slightly more complicated protocol (4 states instead of 3).
• See architecture books (MESI protocol).
![Page 38: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/38.jpg)
Problem: False Sharing
• Occurs when two or more processors access different data in same cache line, and at least one of them writes.
• Leads to ping-pong effect.
![Page 39: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/39.jpg)
False Sharing: Example (1 of 3)
for( i=0; i<n; i++ )
a[i] = b[i];
• Let’s assume we parallelize code: – p = 2– element of a takes 4 words– cache line has 32 words
![Page 40: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/40.jpg)
False Sharing: Example (2 of 3)
a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7]
cache line
Written by processor 0
Written by processor 1
![Page 41: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/41.jpg)
False Sharing: Example (3 of 3)
P0
P1
a[0]
a[1]
a[2] a[4]
a[3] a[5]
...inv data
![Page 42: ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines](https://reader035.vdocuments.net/reader035/viewer/2022062301/5697bfc41a28abf838ca604a/html5/thumbnails/42.jpg)
Summary
• Sequential consistency.
• Bus-based coherence protocols.
• False sharing.