Download - Shared Memory Systems
![Page 1: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/1.jpg)
1
Shared Memory Systems
CEG 4131 Computer Architecture III
Miodrag Bolic
![Page 2: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/2.jpg)
2
Overview
• Next 4 to 5 lectures
Software and hardware components of the shared memory system, Altera’s solution
Designing parallel programs
Cache coherence: Snooping protocols,Directory based protocols
Case studies and examples
![Page 3: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/3.jpg)
3
Outline
• Characteristics of shared memory systems• Programming shared-memory multiprocessors• Hardware implementation
– Architectures– Memory access– Caches– Synchronization
![Page 4: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/4.jpg)
4
Characteristics of shared memory systems [3]
• Any processor can directly reference any memory location.
• Communication occurs implicitly as result of loads and stores.
• Location of data in memory is transparent to the programmer.
• Inherently provided on wide range of platforms (standard processors today have specific extra hardware for share memory systems)
• Memory may be physically distributed among processors.
![Page 5: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/5.jpg)
5
![Page 6: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/6.jpg)
6
Requirements [3]
• Support for memory coherency – The machine must make sure that all of the processing nodes
have an accurate picture of the most up-to-date memory.
• Support for atomic operations on data – The machine must allow for only one processor to change data
at a time. – Nonatomic operation: One processor requests data and before
the request is answered, another processor changes that data.
![Page 7: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/7.jpg)
7
Shared Memory Program [3]Sum all the elements of an array of dimension size.
INITIALIZE; //assign proc_nums and num_procs
if (proc_num == 0) { //initialize the sum
read_array(global_array_to_sum, size); //read the array and array size from file
LOCK(global_sum);
global_sum = 0;
UNLOCK(global_sum);
}
local_sum = 0;
size_to_sum = size/num_procs;
lower_ind = size_to_sum * proc_num;
upper_ind = size_to_sum * (proc_num + 1);
BARRIER(num_procs); //waits for num_procs to get to this point in the program
//if size = 100, num_proc = 4, processor 0 sums 0 to 24, proc 1 sums 25 to 49, etc
for (int i = lower_ind; i < upper_ind; i++)
local_sum += global_array_to_sum[i];
LOCK(global_sum); //locks the sum variable so only this process can change it
global_sum += local_sum;
UNLOCK(global_sum); //gives the sum back so other procs can add to it
BARRIER(num_procs); //waits for num_procs to get to this point in the program
if (proc_num == 0)
printf("sum is %d", global_sum);
![Page 8: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/8.jpg)
8
Multiprocessor Software Functions – Example [3]
• INITIALIZE – assigns a number (proc_num) to each processor in the system; assigns the total number of processors (num_procs).
• LOCK(data) – Allows a processor to “check out” a certain piece of shared data. – While one processor has the data locked, no other processors
can obtain the lock.– The lock is blocking, so once a LOCK is encountered, execution
of the program cannot proceed until the LOCK is obtained.
• UNLOCK(data) – releases a lock so that other processors can obtain it.
• BARRIER(n_procs) – When a BARRIER is encountered, a processor waits at that BARRIER until n_procs processors reach the BARRIER, then execution can proceed.
![Page 9: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/9.jpg)
9
Architecture [3]
Both static and dynamic networks can be used to connect processors and shared memory
An example of shared-bus architecture with 4 processors
![Page 10: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/10.jpg)
10
Natural Extensions of Memory System [4]
I/O devicesMem
P1
$ $
Pn
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
P1
$
Interconnection network
$
Pn
Mem Mem
(b) Bus-based shared memory
(c) Dancehall
(a) Shared cache
First-level $
Bus
P1
$
Interconnection network
$
Pn
Mem Mem
(d) Distributed-memory
![Page 11: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/11.jpg)
11
Memory Organization [1]
![Page 12: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/12.jpg)
12
Caches and Cache Coherence [4]
• Caches play key role in all cases– Reduce average data access time– Reduce bandwidth demands placed on shared interconnect
• But private processor caches create a problem– Copies of a variable can be present in multiple caches – A write by one processor may not become visible to others
• They’ll keep accessing stale value in their caches
– Cache coherence problem– Need to take actions to ensure visibility
![Page 13: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/13.jpg)
13
Inconsistency in Data Sharing• Suppose two processors each use (read) a data item X
from a shared memory. Then each processor’s cache will have a copy of X that is consistent with the shared memory copy.
• Now suppose one processor modifies X (to X’). Now that processor’s cache is inconsistent with the other processor’s cache and the shared memory.
• With a write-through cache, the shared memory copy will be made consistent, but the other processor still has an inconsistent value (X).
• With a write-back cache, the shared memory copy will be updated eventually, when the block containing X (actually X’) is replaced or invalidated.
![Page 14: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/14.jpg)
14
Cache coherence
![Page 15: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/15.jpg)
15
Mutual Exclusion [4]
• Provided by LOCK-UNLOCK around critical section– Set of operations we want to execute atomically– Implementation of LOCK/UNLOCK must guarantee mutual excl.
• Can lead to significant serialization if contended– Especially since expect non-local accesses in critical section
• Mutex stands for “mutual exclusion”
![Page 16: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/16.jpg)
16
Simple Software Lock [4]lock: ld register, location /* copy location to
register*/
cmp register, #0 /* compare with 0 */
bnz lock /* if not 0, try again */
st location, #1 / * store 1 to mark it locked */
ret /* return control to caller */
and
unlock: st location, #0 /* write 0 to location */
ret /* return control to caller*/
• Problem: lock needs atomicity in its own implementation– Read (test) and write (set) of lock variable by a process not atomic
• Solution: atomic read-modify-write or exchange instructions– atomically test value of location and set it to another value, return
success or failure somehow
![Page 17: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/17.jpg)
17
Atomic Exchange Instruction [4]
• Specifies a location and register. In atomic operation:– Value in location read into a register– Another value (function of value read or not) stored into location
• Many variants• Simple example: test&set
– Value in location read into a specified register– Constant 1 stored into location– Successful if value loaded into register is 0– Other constants could be used instead of 1 and 0
• Can be used to build locks
![Page 18: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/18.jpg)
18
Simple Test&Set Lock [4]lock: t&s register, location
bnz lock /* if not 0, try again */
ret /* return control to caller */
unlock: st location, #0 /* write 0 to location */
ret /* return control to caller */
• Other read-modify-write primitives can be used too– Swap
![Page 19: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/19.jpg)
19
Mutual Exclusion: Altera [1]
• A mutex allows cooperating processors to agree that one of them should be allowed mutually exclusive access to a hardware resource in the system.
• Component that is added in SOPC builder• Shared memory can be accessed without using mutexes
![Page 20: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/20.jpg)
20
Altera Mutex [2]
![Page 21: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/21.jpg)
21
Example: Opening and locking a mutex [2]
#include <altera_avalon_mutex.h>
/* get the mutex device handle */
alt_mutex_dev* mutex = altera_avalon_mutex_open( “/dev/mutex” );
/* acquire the mutex, setting the value to one */
altera_avalon_mutex_lock( mutex, 1 );
/*
* Access a shared resource here.
*/
/* release the lock */
altera_avalon_mutex_unlock( mutex );
![Page 22: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/22.jpg)
22
Performance Criteria (T&S Lock) [4]• Uncontended Latency
– Very low if repeatedly accessed by same processor; indept. of p
• Traffic– Lots if many processors compete; poor scaling with p
– Each t&s generates invalidations, and all rush out again to t&s
• Storage– Very small (single variable); independent of p
• Fairness– Poor, can cause starvation
• Test&set with backoff similar, but less traffic
• Luckily, better hardware primitives as well as algorithms exist
![Page 23: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/23.jpg)
23
Improved Hardware Primitives: LL-SC [4]
• Goals: – Test with reads– Failed read-modify-write attempts don’t generate invalidations– Nice if single primitive can implement range of r-m-w operations
• Load-Locked (or -linked), Store-Conditional• LL reads variable into register• Follow with arbitrary instructions to manipulate its value• SC tries to store back to location if and only if no one
else has written to the variable since this processor’s LL– If SC succeeds, means all three steps happened atomically– If fails, doesn’t write or generate invalidations (need to retry LL)– Success indicated by condition codes;
![Page 24: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/24.jpg)
24
Simple Lock with LL-SC [4]
lock: ll reg1, location /* LL location to reg1 */
bnz reg1, lock /* if location is locked, try again*/
sc location, reg2 /* SC reg2 into location*/
beqz lock /* if failed, start again */
ret /* return control to the caller of lock */
*-------------------------*/
unlock: st location, #0 /* write 0 to location */
ret
• SC can fail (without putting transaction on bus) if:– Tries to get bus but another processor’s SC gets bus first
• LL, SC are not lock, unlock respectively– Only guarantee no conflicting write to lock variable between them– But can use directly to implement simple operations on shared variables
![Page 25: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/25.jpg)
25
A Simple Centralized Barrier [2]
• Shared counter maintains number of processes that have arrived
– increment when arrive (lock), check until reaches numprocsstruct bar_type {
int counter; struct lock_type lock; int flag = 0;
} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)
bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */
}else while (bar_name.flag == 0) {}; /* busy wait for release */
}
![Page 26: Shared Memory Systems](https://reader035.vdocuments.net/reader035/viewer/2022081419/56815a73550346895dc7d9c5/html5/thumbnails/26.jpg)
26
References
1. Altera Corp., Creating Multiprocessor NIOS II System Tutorial, 2005.
2. Altera Corp. ,Altera Embedded Peripherals Handbook, 2005.
3. J. Kowalczyk, “Multiprocessor Systems,” Xilinx, 2003.
4. D. Culler, J. P. Singh, Parallel Computer Architectures, A Hardware/Software Approach, Morgan Kaufman, 1999.