@spcl eth evaluating the cost of atomic operations on ...spcl.inf.ethz.ch @spcl_eth atomics:...
TRANSCRIPT
spcl.inf.ethz.ch
@spcl_eth
MACIEJ BESTA, HERMANN SCHWEIZER, TORSTEN HOEFLER
Evaluating the Cost of Atomic Operations
on Modern Architectures
spcl.inf.ethz.ch
@spcl_eth
LARGE-SCALE IRREGULAR GRAPH PROCESSING
spcl.inf.ethz.ch
@spcl_eth
LARGE-SCALE IRREGULAR GRAPH PROCESSING
spcl.inf.ethz.ch
@spcl_eth
LARGE-SCALE IRREGULAR GRAPH PROCESSING
spcl.inf.ethz.ch
@spcl_eth
LARGE-SCALE IRREGULAR GRAPH PROCESSING
spcl.inf.ethz.ch
@spcl_eth
LARGE-SCALE IRREGULAR GRAPH PROCESSING
[1] A. Lumsdaine et al. Challenges in Parallel Graph Processing. Parallel Processing Letters. 2007.
spcl.inf.ethz.ch
@spcl_eth
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
PACT’15
HPDC’15
HPDC’14
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
Topologies
PACT’15
SC14
HPDC’15
HPDC’14
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
Topologies
OS, middleware
PACT’15
SC14
ICS’15 HPDC’15
HPDC’14
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
Topologies
OS, middleware
Algorithms
PACT’15
SC14
ICS’15 HPDC’15
HPDC’14
SC13
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
Topologies
OS, middleware
Algorithms
Programming models
PACT’15
SC14
ICS’15 HPDC’15
HPDC’14
SC13
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
Topologies
OS, middleware
Algorithms
Programming models
PACT’15
SC14
ICS’15 HPDC’15
HPDC’14
SC13
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
spcl.inf.ethz.ch
@spcl_eth
Com
puting a
bstr
actions
Hardware
Topologies
OS, middleware
Algorithms
Programming models
PACT’15
SC14
ICS’15 HPDC’15
HPDC’14
SC13
HPDC’16
A BRIEF SUMMARY OF RESEARCH I DO IN HPC
Most of these layers require
efficient synchronization
spcl.inf.ethz.ch
@spcl_eth
SYNCHRONIZATION MECHANISMS
LOCKS
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
LOCKS
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Intuitive
semantics
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Intuitive
semantics
SYNCHRONIZATION MECHANISMS
LOCKS
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Intuitive
semantics
SYNCHRONIZATION MECHANISMS
LOCKS
Serialization
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Possibly complex
protocols
Intuitive
semantics
SYNCHRONIZATION MECHANISMS
LOCKS
Serialization
An example
graph
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Possibly complex
protocols
Intuitive
semantics
SYNCHRONIZATION MECHANISMS
LOCKS
Serialization
An example
graph
High performance
distributed locks?
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Complex access
patterns
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
High
performance
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Complex access
patterns
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
High
performance
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Complex access
patterns
Very common, truly
hardware mechanizm
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
High
performance
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Complex access
patterns
Very common, truly
hardware mechanizm
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Complex
protocols
High
performance
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Complex access
patterns
Very common, truly
hardware mechanizm
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Complex
protocols
High
performance
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Subtle
issues (ABA
problem, ...)Complex access
patterns
Very common, truly
hardware mechanizm
spcl.inf.ethz.ch
@spcl_eth
Proc qProc p
Complex
protocols
High
performance
SYNCHRONIZATION MECHANISMS
ATOMIC OPERATIONS
Subtle
issues (ABA
problem, ...)Complex access
patterns
Do we really
understand their
performance?
Very common, truly
hardware mechanizm
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: POPULARITY
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: POPULARITY
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: POPULARITY
[PPoPP’14] Fast
Concurrent Lock-Free
Binary Search Trees
[PPoPP’14] A General
Technique for Non-
blocking Trees
[PPoPP’14] Practical
Concurrent Binary Search
Trees via Logical Ordering
[PPoPP’14] A Practical
Wait-Free Simulation for
Lock-Free Data Structures
[PPoPP’15]
[PPoPP’15]
[SPAA’15]
[SPAA’16]
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: POPULARITY
[PPoPP’14] Fast
Concurrent Lock-Free
Binary Search Trees
[PPoPP’14] A General
Technique for Non-
blocking Trees
[PPoPP’14] Practical
Concurrent Binary Search
Trees via Logical Ordering
[PPoPP’14] A Practical
Wait-Free Simulation for
Lock-Free Data Structures
[PPoPP’15]
[PPoPP’15]
[SPAA’15]
[SPAA’16]Used in so many designs...
But do we really know their
raw performance?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: POPULARITY
[PPoPP’14] Fast
Concurrent Lock-Free
Binary Search Trees
[PPoPP’14] A General
Technique for Non-
blocking Trees
[PPoPP’14] Practical
Concurrent Binary Search
Trees via Logical Ordering
[PPoPP’14] A Practical
Wait-Free Simulation for
Lock-Free Data Structures
[PPoPP’15]
[PPoPP’15]
[SPAA’15]
[SPAA’16]Used in so many designs...
But do we really know their
raw performance?
Raw performance... Of what?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache level?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache level?
Locality?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
Performance
metrics?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
Performance
metrics?
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
Contention?
Performance
metrics?
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Atomic?
Contention?
Performance
metrics?
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Atomic?
Contention?
Operand
size?
Performance
metrics?
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Atomic?
Contention?
Operand
size?
Performance
metrics?
Alignment?
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Atomic?
Contention?
Operand
size?
Performance
metrics?
Alignment?
Mechanisms?
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
Compare-and-
Swap (CAS)
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
Compare-and-
Swap (CAS)
Fetch-and-Add
(FAA)
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Atomic?
Compare-and-
Swap (CAS)
Fetch-and-Add
(FAA)
Swap (SWP)
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Performance
metrics?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Performance
metrics?
Latency
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Performance
metrics?
Bandwidth
Latency
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Modified: in one
cache and dirty
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Modified: in one
cache and dirty
Exclusive: in one
cache and clean
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Modified: in one
cache and dirty
Exclusive: in one
cache and clean
Shared: in >1
cache and clean
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Cache
coherence
state?
Modified: in one
cache and dirty
Exclusive: in one
cache and clean
Shared: in >1
cache and clean
Invalid: garbage
data
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Architecture
spcl.inf.ethz.ch
@spcl_eth
ATOMICS: PERFORMANCE DIMENSIONS
Architecture
spcl.inf.ethz.ch
@spcl_eth
RESEARCH QUESTIONS
spcl.inf.ethz.ch
@spcl_eth
How do we model the
performance of
atomics?
RESEARCH QUESTIONS
spcl.inf.ethz.ch
@spcl_eth
What is the
performance
difference between
various atomics?
How do we model the
performance of
atomics?
RESEARCH QUESTIONS
spcl.inf.ethz.ch
@spcl_eth
What is the
performance
difference between
various atomics?
What is the influence
of various parameters
and mechanisms?
How do we model the
performance of
atomics?
RESEARCH QUESTIONS
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
LATENCY MODEL
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
LATENCY MODEL
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Read for ownership
LATENCY MODEL
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Read for ownership
LATENCY MODEL
Cache
coherence state
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Read for ownership
= max(read latency, invalidation latency)
LATENCY MODEL
Cache
coherence state
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
= max(read latency, invalidation latency)
LATENCY MODEL
Cache
coherence state
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
= max(read latency, invalidation latency)
LATENCY MODEL
Cache
coherence state
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
= max(read latency, invalidation latency)
= constant
LATENCY MODEL
Cache
coherence state
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
= max(read latency, invalidation latency)
= constant
LATENCY MODEL
Cache
coherence state
Atomic
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
= max(read latency, invalidation latency)
= constant
LATENCY MODEL
Cache
coherence state
Atomic
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
Atomic
= max(read latency, invalidation latency)
= constant
LATENCY MODEL
Cache
coherence state
Atomic
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= max(read latency, invalidation latency)
= constant
LATENCY MODEL
Cache
coherence state
Atomic
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= constant
LATENCY MODEL
EXCLUSIVE OR MODIFIED STATE
= max(read latency, invalidation latency)
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= read latency
= constant
LATENCY MODEL
EXCLUSIVE OR MODIFIED STATE
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= read latency
= constant
LATENCY MODEL
EXCLUSIVE OR MODIFIED STATE
predictions observed
data
mean of
observed data
spcl.inf.ethz.ch
@spcl_eth
LATENCY
HASWELL, EXCLUSIVE
spcl.inf.ethz.ch
@spcl_eth
CAS FAA
LATENCY
BULLDOZER, EXCLUSIVE
spcl.inf.ethz.ch
@spcl_eth
LATENCY
HASWELL, EXCLUSIVE Alignment?
spcl.inf.ethz.ch
@spcl_eth
64 bit 128 bit
LATENCY
BULLDOZER, EXCLUSIVE Operand
size?
spcl.inf.ethz.ch
@spcl_eth
BANDWIDTH
HASWELL, ATOMICS
spcl.inf.ethz.ch
@spcl_eth
CONCLUSIONS
PERFORMANCE INSIGHTS
spcl.inf.ethz.ch
@spcl_eth
CONCLUSIONS
PERFORMANCE INSIGHTS
The same latency of
different atomics in most
scenarios
spcl.inf.ethz.ch
@spcl_eth
CONCLUSIONS
PERFORMANCE INSIGHTS
The same latency of
different atomics in most
scenarios
CAS is the fastest for
some cases
spcl.inf.ethz.ch
@spcl_eth
CONCLUSIONS
PERFORMANCE INSIGHTS
The same latency of
different atomics in most
scenarios
CAS is the fastest for
some cases
Unaligned atomics
should be avoided at all
costs
spcl.inf.ethz.ch
@spcl_eth
CONCLUSIONS
PERFORMANCE INSIGHTS
The same latency of
different atomics in most
scenarios
CAS is the fastest for
some cases
Unaligned atomics
should be avoided at all
costs
No parallel execution
(low bandwidth) even if
there are no data deps
spcl.inf.ethz.ch
@spcl_eth
CONCLUSIONS
PERFORMANCE INSIGHTS
The same latency of
different atomics in most
scenarios
CAS is the fastest for
some cases
Small operand sizes
give best performance
Unaligned atomics
should be avoided at all
costs
No parallel execution
(low bandwidth) even if
there are no data deps
spcl.inf.ethz.ch
@spcl_eth
Atomics Read
LATENCY
HASWELL, EXCLUSIVE
spcl.inf.ethz.ch
@spcl_eth
Atomics Read
LATENCY
HASWELL, MODIFIED
spcl.inf.ethz.ch
@spcl_eth
LATENCY
IVY BRIDGE, EXCLUSIVE
CAS
spcl.inf.ethz.ch
@spcl_eth
LATENCY
IVY BRIDGE, EXCLUSIVE
CAS
spcl.inf.ethz.ch
@spcl_eth
LATENCY
XEON PHI, MODIFIED / EXCLUSIVE
CAS
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= constant
LATENCY MODEL
EXCLUSIVE OR MODIFIED STATE
= max(read latency, invalidation latency)
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= constant
LATENCY MODEL
SHARED STATE
= max(read latency, invalidation latency)
spcl.inf.ethz.ch
@spcl_eth
Core
Cache Cache Cache
…Cache line
Cache line Read for ownership
Execute
AtomicCache
coherence state
= invalidation latency
= constant
LATENCY MODEL
SHARED STATE
spcl.inf.ethz.ch
@spcl_eth
Atomics Read
LATENCY
HASWELL, SHARED
spcl.inf.ethz.ch
@spcl_eth
How to force cache coherence state
F(M): Write cache line (invalidates all copies)
F(E): F(M) flush read
F(S): F(E) read by some other core
D. Hackenberg, D. Molka, W. Nagel. Comparing Cache Architectures and Coherency Protocols on x86-64
Multicore SMP Systems. MICRO’09