spin locks and contention - cs.brown.edu

CS176Progress
• Models
– But idealized (so we forgot to mention a few things)
• Protocols
– Elegant
– Important
New Focus: Performance
• Protocols
– And realistic (your mileage may vary)
Art of Multiprocessor Programming 4
Kinds of Architectures
Kinds of Architectures
MIMD Architectures
• Memory Contention
• Communication Contention
• Communication Latency
Shared Bus
Today: Revisit Mutual Exclusion
• Performance, not just correctness
(1)
What Should you do if you can’t
get a lock?
• Give up the processor
– Always good on uniprocessor
What Should you do if you can’t
get a lock?
• Give up the processor
– Always good on uniprocessor
Basic Spin-Lock
Basic Spin-Lock
Basic Spin-Lock
Basic Spin-Lock
Basic Spin-Lock
Basic Spin-Lock
Review: Test-and-Set
• Boolean value
• Test-and-set (TAS)
– Return value tells if prior value was true or
false
• TAS aka “getAndSet”
boolean prior = value;
AtomicBoolean lock
= new AtomicBoolean(false)
AtomicBoolean lock
= new AtomicBoolean(false)
“test-and-set” or TAS
Test-and-Set Locks
• Acquire lock by calling TAS
– If result is false, you win
– If result is true, you lose
• Release lock by writing false
Test-and-set Lock
class TASlock {
AtomicBoolean state =
new AtomicBoolean(false);
void lock() {
while (state.getAndSet(true)) {}
Test-and-set Lock
class TASlock {
void lock() {
Test-and-set Lock
class TASlock {
void lock() {
Test-and-set Lock
class TASlock {
void lock() {
Space Complexity
• N thread spin-lock uses O(1) space
• As opposed to O(n) Peterson/Bakery
• How did we overcome the W(n) lower
bound?
Performance
• Experiment
• How long should it take?
• How long does it take?
Graph
ideal
Mystery #1
ti m
Test-and-Test-and-Set Locks
• Lurking stage
– Spin while read returns true (lock taken)
• Pouncing state
– Read returns false (lock free)
– Call TAS to acquire lock
– If TAS loses, back to lurking
Test-and-test-and-set Lock
class TTASlock {
void lock() {
while (true) {
while (state.get()) {}
if (!state.getAndSet(true))
class TTASlock {
void lock() {
while (true) {
class TTASlock {
void lock() {
while (true) {
Mystery #2
TAS lock
TTAS lock
Mystery
• Both
• Except that
– Neither approaches ideal
Opinion
• TAS & TTAS methods
• Need a more detailed model …
Bus-Based Architectures
•Processors and memory all
•Address & state information
Granularity
than a word
the address (today 64 or 128 bytes)
Locality
probably use it again soon
– Fetch from cache, not memory
• If you use an address now, you will
probably use a nearby address soon
– In the same cache line
L1 and L2 Caches
L1 and L2 Caches
L1 and L2 Caches
Jargon Watch
• Cache hit
– Good Thing™
Jargon Watch
• Cache hit
– Good Thing™
• Cache miss
– “I had to shlep all the way to memory for
that data”
– Bad Thing™
Cave Canem
– Illustrates basic principles
When a Cache Becomes
• By evicting an existing entry
• Need a replacement policy
heuristic
Fully Associative Cache
– Advantage: can replace any line
– Disadvantage: hard to find lines
Direct Mapped Cache
– Disadvantage: must replace fixed line
K-way Set Associative Cache
– Advantage: pretty easy to find a line
– Advantage: some choice in replacing line
Multicore Set Associativity
– Why? Because cores share sets
– Threads cut effective size if accessing
different data
Cache Coherence
• A writes to x
literature
MESI
• Modified
back to memory
MESI
• Modified
back to memory
MESI
• Modified
back to memory
• Shared
MESI
memory
• Invalid – Cache contents not meaningful
Bus
cache
Bus
Bus
Bus
S
S
Write-Through Caches
– For example, loop indexes …
Write-Through Caches
– For example, loop indexes …
Write-Back Caches
– Need the cache for something else
– Another processor wants it
Bus
Invalidate
Bus
memory
cachedatadata
data
cache
Invalidate
x
cache
Bus
Invalidate
memory
cachedata
cache
Bus
Invalidate
memory
cachedata
data
cache
Bus
Invalidate
memory
cachedata
data
in any cache, so no need to change it
now (expensive)
Mutual Exclusion
– Release/Acquire latency
Simple TASLock
– delayed behind spinners
Test-and-test-and-set
– Spin on local cache
• Problem: when lock is released
– Invalidation storm …
Busy
Bus
memory
busybusybusy
busy
Bus
On Release
On Release
Problems
Measuring Quiescence Time
P1
P2
Pn
If pause < quiescence time,
Quiescence Time
Mystery Explained
TAS lock
TTAS lock
Solution: Introduce Delay
spin locktime dr1dr2d
• But I fail to get it
• There must be contention
Dynamic Example:
Exponential Backoff
time d2d4d
spin lock
– Wait random duration before retry
– Each subsequent failure doubles expected wait
Exponential Backoff Lock
public void lock() {
int delay = MIN_DELAY;
Spin-Waiting Overhead
TTAS Lock
Backoff lock
ti m
Backoff: Other Issues
Idea
• Each thread
Anderson Queue Lock
idle
Anderson Queue Lock
acquiring
getAndIncrement
Anderson Queue Lock
acquiring
getAndIncrement
Anderson Queue Lock
acquired
Mine!
Anderson Queue Lock
acquired acquiring
Anderson Queue Lock
acquired acquiring
Anderson Queue Lock
acquired acquiring
acquired
acquiring
released
acquired
released
acquired
Yow!
Anderson Queue Lock
Anderson Queue Lock
Anderson Queue Lock
Anderson Queue Lock
Anderson Queue Lock
}
Anderson Queue Lock
} Take next slot
Anderson Queue Lock
Anderson Queue Lock
} Prepare slot for re-use
Anderson Queue Lock
}
released
acquired Spin
116
released
acquired Spin
Spin
on
my
line
Performance
Anderson Queue Lock
Anderson Queue Lock
per thread
contenders?
CLH Lock
• FCFS order
Initially
false
tail
idle
Initially
false
tail
idle
Initially
false
tail
idle
Initially
false
tail
idle
Purple Wants the Lock
Purple Has the Lock
Red Wants the Lock
Red Wants the Lock
Red Wants the Lock
Red Wants the Lock
Red Wants the Lock
Red Wants the Lock
Red Wants the Lock
Purple Releases
Purple Releases
Space Usage
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
CLH Queue Lock
Code in book shows how it’s done.)
CLH Lock
– Small, constant-sized space
architectures
NUMA and cc-NUMA
NUMA Machines
NUMA Machines
CLH Lock
memory
MCS Lock
• FCFS order
• Small, Constant-size overhead
Initially
falsefalse
idle
tail
Acquiring
falsefalse
true
acquiring
Acquiring
false
tail
false
true
acquired
swap
Acquiring
false
tail
false
true
acquired
Acquired
false
tail
false
true
acquired
Acquiring
tail
false
Acquiring
tail
Acquiring
tail
Acquiring
tail
Acquiring
tail
Acquiring
tail
MCS Queue Lock
MCS Queue Lock
MCS Queue Lock
MCS Queue Lock
MCS Queue Lock
MCS Queue Lock
Purple Release
Purple Release
looking at the queue, I see
another thread is active
Purple Release
looking at the queue, I see
another thread is active
thread so must wait for it
to identify its node
Purple Release
Purple Release
Purple Release
Purple Release
MCS Queue Unlock
MCS Queue Lock
MCS Queue Lock
MCS Queue Lock
MCS Queue Lock
Abortable Locks
• What if you want to give up waiting for a
lock?
Back-off Lock
• Extra benefit:
Queue Locks
• Need a graceful way out
Queue Locks
Queue Locks
Queue Locks
Queue Locks
Queue Locks
Queue Locks
Queue Locks
Queue Locks
Queue Locks
Abortable CLH Lock
– Removing node in a wait-free way is hard
• Idea:
Initially
tail
Initially
tail
Acquiring
tail
acquiring
A
Acquiring
acquiring
A
Acquiring
acquiring
A
Swap
Acquiring
acquiring
A
Acquired
locked
A
Normal Case
One Thread Aborts
Successor Notices
spinningTimed outlocked
Non-Null means
Recycle Predecessor’s Node
Spin on Earlier Node
Spin on Earlier Node
Time-out Lock
static QNode AVAILABLE
Time-out Lock
Time-out Lock
Time-out Lock
Time-out Lock public boolean lock(long timeout) {
QNode qnode = new QNode();
Time-out Lock …
Time-out Lock …
…
Time-out Lock …
Time-out Lock …
Time-out Lock …
Time-out Lock …
Time-out Lock …
Tell it about myPred
Time-out Lock …
QNode qnode = myNode.get();
if (!tail.compareAndSet(qnode, null))
public void unlock() {
public void unlock() {
null, no clean up since no
successor waiting
• Lock Fairness is FCFS
• Is this a good fit with NUMA and Cache-Coherent NUMA machines?
Lock Data Access in NUMA Machine
Node 1
MCS lock
Node 2
• locality crucial to NUMA performance
• Big gains if threads from same
node/cluster obtain lock consecutively
time d2d4d
Unfairness is key to performance
Hierarchical Backoff Lock (HBO)
lock word
Thread at local head splices local queue into global queue
Each thread spins on cached copy of predecessor’s node
Threads access 4 cache lines in CS
HBO
HCLH
– Splicing into both local and global requires CAS
– Hard to get long local sequences
Lock Cohorting
any lock into a NUMA lock
• Allows combining different lock types
• But need these locks to have certain
properties (will discuss shortly)
Lock Cohorting
Global Lock
On release: if non- empty cohort of waiting threads, release only local lock; leave mark
CS
empty cohort
On release: since cohort is empty must release global lock to avoid deadlock
Thread that acquired local lock can now acquire global lock…
Thread Obliviousness
– After being acquired by one thread,
– Can be released by another
Cohort Detection
– It can tell whether any thread is trying to
acquire it
Lock Cohorting
than one releasing it
• Local lock: cohort detection
is waiting to acquire it
Lock Cohorting: C-BO-MCS Lock
Local MCS lock tail
In MCS Lock, cohort detection by checking successor pointer
Bound number of consecutive acquires to control unfairness
Two new states: acquire local and acquire global. Do we own global lock?
Lock Cohorting: C-BO-BO Lock
How to add cohort detection property to BO lock?
d2d4d
d2d4d
successorExists reset on lock release.
Release might overwrite another successor’s write … but we don’t care…why?
C-BO-BO is a Time-Out
If releasing thread finds successorExists false, it releases global lock
Aborting thread resets successorExists field before leaving local lock. Spinning threads set it to true.
Lock Cohorting
– Practically no tuning
– Has whatever properties you want: • Can be more or less fair, abortable…
just choose the appropriate type of locks…
• Disadvantages:
1 16 32 64 96 128 160 192 224 256
T h
A-BO-BO
One Lock To Rule Them All?
• TTAS+Backoff, CLH, MCS, ToLock…
• Each better than others in some way
• There is no one solution
• Lock we pick really depends on:
– the application
– the hardware
This work is licensed under a Creative Commons Attribution-
ShareAlike 2.5 License.
• You are free:
– to Remix — to adapt the work
• Under the following conditions:
– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that
the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you
may distribute the resulting work only under the same, similar or a
compatible license.
• For any reuse or distribution, you must make clear to others the license
terms of this work. The best way to do this is with a link to
– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from
the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.

spin locks and contention - cs.brown.edu

Documents