includes slides from course cs162 at uc berkeley, by prof anthony d. joseph and ion stoica and from...
TRANSCRIPT
![Page 1: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/1.jpg)
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoicaand from course CS194, by prof. Katherine Yelick
Shared Memory ProgrammingSynchronization primitives
Ing. Andrea Marongiu([email protected])
![Page 2: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/2.jpg)
• Program is a collection of threads of control.• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common blocks,
or global heap.• Threads communicate implicitly by writing and reading shared variables.• Threads coordinate by synchronizing on shared variables
PnP1P0
s s = ...
y = ..s ...
Shared memory
i: 2 i: 5 Private memory
i: 8
Shared Memory Programming
![Page 3: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/3.jpg)
Thread 1
for i = 0, n/2-1 s = s + sqr(A[i])
Thread 2
for i = n/2, n-1 s = s + sqr(A[i])
static int s = 0;
• Problem is a race condition on variable s in the program• A race condition or data race occurs when:
- two processors (or two threads) access the same variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so they could happen simultaneously
Shared Memory code for computing a sum
![Page 4: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/4.jpg)
Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …
static int s = 0;
• Assume A = [3,5], f is the square function, and s=0 initially• For this program to work, s should be 34 at the end
• but it may be 34,9, or 25
• The atomic operations are reads and writes• Never see ½ of one number, but no += operation is not atomic• All computations happen in (private) registers
9 250 09 25
259
3 5A f = square
Shared Memory code for computing a sum
![Page 5: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/5.jpg)
Thread 1
local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1
Thread 2
local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2
static int s = 0;
• Since addition is associative, it’s OK to rearrange order• Right?
• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s
Shared Memory code for computing a sum
ATOMIC ATOMIC
![Page 6: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/6.jpg)
Atomic Operations• To understand a concurrent program, we need to know what the
underlying indivisible operations are!
• Atomic Operation: an operation that always runs to completion or not at all• It is indivisible: it cannot be stopped in the middle and state cannot be
modified by someone else in the middle• Fundamental building block – if no atomic operations, then have no way
for threads to work together
• On most machines, memory references and assignments (i.e. loads and stores) of words are atomic
![Page 7: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/7.jpg)
Role of Synchronization• “A parallel computer is a collection of processing elements
that cooperate and communicate to solve large problems fast.”
• Types of Synchronization• Mutual Exclusion• Event synchronization
• point-to-point• group• global (barriers)
• How much hardware support?
Most used forms of synchronization in shared
memory parallel programming
![Page 8: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/8.jpg)
Motivation: “Too much milk”
• Example: People need to coordinate:
Arrive home, put milk away3:30
Buy milk3:25
Arrive at storeArrive home, put milk away3:20
Leave for storeBuy milk3:15
Leave for store3:05
Look in Fridge. Out of milk3:00
Look in Fridge. Out of milkArrive at store3:10
Person BPerson ATime
![Page 9: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/9.jpg)
Definitions
• Synchronization: using atomic operations to ensure cooperation between threads• For now, only loads and stores are atomic• hard to build anything useful with only reads and writes
• Mutual Exclusion: ensuring that only one thread does a particular thing at a time• One thread excludes the other while doing its task
• Critical Section: piece of code that only one thread can execute at once• Critical section and mutual exclusion are two ways of describing the
same thing• Critical section defines sharing granularity
![Page 10: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/10.jpg)
More Definitions• Lock: prevents someone from doing something
• Lock before entering critical section and before accessing shared data
• Unlock when leaving, after accessing shared data
• Wait if locked• Important idea: all synchronization involves waiting
• Example: fix the milk problem by putting a lock on refrigerator• Lock it and take key if you are going to go buy milk
• Fixes too much (coarse granularity): roommate angry if only wants orange juice
#$@%@#$@
![Page 11: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/11.jpg)
• Need to be careful about correctness of concurrent programs, since non-deterministic• Always write down desired behavior first• think first, then code
• What are the correctness properties for the “Too much milk” problem?• Never more than one person buys• Someone buys if needed
• Restrict ourselves to use only atomic load and store operations as building blocks
Too Much Milk: Correctness properties
![Page 12: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/12.jpg)
Too Much Milk: Solution #1• Use a note to avoid buying too much milk:
• Leave a note before buying (kind of “lock”)• Remove note after buying (kind of “unlock”)• Don’t buy if note (wait)
• Suppose a computer tries this (remember, only memory read/write are atomic):
if (noMilk) { if (noNote) { leave Note; buy milk; remove note; }
}
• Result?
![Page 13: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/13.jpg)
Too Much Milk: Solution #1Thread A Thread B
if (noMilk) if (noNote) { if (noMilk)
if (noNote) { leave Note;
buy milk; remove note; } } leave Note;
buy milk; remove note; } }
Need to atomically
update lock variable
![Page 14: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/14.jpg)
How to Implement Lock?
• Lock: prevents someone from accessing something• Lock before entering critical section (e.g., before accessing shared data)• Unlock when leaving, after accessing shared data• Wait if locked
• Important idea: all synchronization involves waiting• Should sleep if waiting for long time
• Hardware atomic instructions?
![Page 15: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/15.jpg)
Examples of hardware atomic instructions
• test&set (&address) { /* most architectures */result = M[address];M[address] = 1;return result;
}
• swap (&address, register) { /* x86 */ temp = M[address];
M[address] = register;register = temp;
}
• compare&swap (&address, reg1, reg2) { /* 68000 */if (reg1 == M[address]) {
M[address] = reg2;return success;
} else {return failure;
}}
Atomic operations!
![Page 16: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/16.jpg)
Implementing Locks with test&set
• Simple solution:
int value = 0; // Free
Acquire() {while (test&set(value)); // while busy
}
Release() {value = 0;
}
• Simple explanation:• If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns
0 so while exits• If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so
while loop continues• When we set value = 0, someone else can get lock
test&set (&address) { result = M[address]; M[address] = 1; return result;}
![Page 17: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/17.jpg)
Too Much Milk: Solution #2• Lock.Acquire() – wait until lock is free, then grab
• Lock.Release() – unlock, waking up anyone waiting
• atomic operations – if two threads are waiting for the lock, only one succeeds to grab the lock
• Then, our milk problem is easy:milklock.Acquire();if (nomilk) buy milk;milklock.Release();
• Once again, section of code between Acquire() and Release() called a “Critical Section”
![Page 18: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/18.jpg)
Thread 1
local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1
Thread 2
local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2
static int s = 0;
• Since addition is associative, it’s OK to rearrange order• Right?
• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s
static lock lk;
lock(lk);
unlock(lk);
lock(lk);
unlock(lk);
Shared Memory code for computing a sum
![Page 19: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/19.jpg)
Performance Criteria for Synch. Ops
• Latency (time per op)• How long does it take if you always win• Especially when light contention
• Bandwidth (ops per sec)• Especially under high contention• How long does it take (averaged over threads) when many others are
trying for it
• Traffic• How many events on shared resources (bus, crossbar,…)
• Storage• How much memory is required?
• Fairness• Can any one threads be “starved” and never get the lock?
![Page 20: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/20.jpg)
Barriers• Software algorithms implemented using locks, flags,
counters• Hardware barriers
• Wired-AND line separate from address/data bus• Set input high when arrive, wait for output to be high to leave
• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors
• even harder with multiple processes per processor
• Difficult to dynamically change number and identity of participants• e.g. latter due to process migration
• Not common today on bus-based machines
![Page 21: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/21.jpg)
struct bar_type { int counter; struct lock_type lock int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)
bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */
}else while (bar_name.flag == 0) {}; /* busy wait for release */
}
A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived
• increment when arrive (lock), check until reaches numprocs• Problem?
![Page 22: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/22.jpg)
A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work
• Must prevent process from entering until all have left previous instance
• Could use another counter, but increases latency and contention
• Sense reversal: wait for flag to take different value consecutive times• Toggle this value only when all processes reach
BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)
UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/
else { UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }}
![Page 23: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/23.jpg)
Centralized Barrier Performance• Latency
• Centralized has critical path length at least proportional to p
• Traffic• About 3p bus transactions
• Storage Cost• Very low: centralized counter and flag
• Fairness• Same processor should not always be last to exit barrier
• No such bias in centralized
• Key problems for centralized barrier are latency and traffic• Especially with distributed memory, traffic goes to same node
![Page 24: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/24.jpg)
Improved Barrier Algorithm
• Separate gather and release trees
• Advantage: use of ordinary reads/writes instead of locks (array of flags)
• 2x(p-1) messages exchanged over the network
• Valuable in distributed network: communicate along different paths
Master-Slave barrier
• Master core gathers slaves on the barrier and releases them
• Use separate, per-core polling flags for different wait stages
Centralized
Contention
Master-Slave
![Page 25: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/25.jpg)
Improved Barrier Algorithm
• Not all messages have same latency
• Need for locality-aware implementation
What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012
Master-Slave
MEM
PROC
XBAR NI
MEM
PROC
XBARNI
MEM
PROC
XBAR
MEM
PROC
XBAR
NI NI
![Page 26: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/26.jpg)
Improved Barrier Algorithm
• Separate arrival and exit trees, and use sense reversal
• Valuable in distributed network: communicate along different paths
• Higher latency (log p steps of work, and O(p) serialized bus xactions)
• Advantage: use of ordinary reads/writes instead of locks
Software combining tree• Only k processors access the same location, where k is degree of tree
Little contention
Centralized
Contention
Tree
![Page 27: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/27.jpg)
Improved Barrier Algorithm
Centralized
Contention
Tree
Software combining tree• Only k processors access the same location, where k is degree of tree
• Separate arrival and exit trees, and use sense reversal
• Valuable in distributed network: communicate along different paths
• Higher latency (log p steps of work, and O(p) serialized bus xactions)
• Advantage: use of ordinary reads/writes instead of locks
![Page 28: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/28.jpg)
Improved Barrier Algorithm
• Hierarchical synchronization
• locality-aware implementation
What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012
Tree
MEM
PROC
XBAR NI
MEM
PROC
XBARNI
MEM
PROC
XBAR
MEM
PROC
XBAR
NI NI
![Page 29: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/29.jpg)
Barrier performance
![Page 30: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/30.jpg)
• Programming model is made up of the languages and libraries that create an abstract view of the machine
• Control• How is parallelism created?• How is are dependencies (orderings) enforced?
• Data• Can data be shared or is it all private?• How is shared data accessed or private data communicated?
• Synchronization• What operations can be used to coordinate parallelism• What are the atomic (indivisible) operations?
Parallel programming models
![Page 31: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649e725503460f94b721c9/html5/thumbnails/31.jpg)
• In this and the upcoming lectures we will see different programming models and the features that each provide with respect to• Control• Data• Synchronization
• Pthreads• OpenMP• OpenCL
Parallel programming models