compiler and runtime support for efficient software transactional memory
DESCRIPTION
Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability - PowerPoint PPT PresentationTRANSCRIPT
Compiler and Runtime Supportfor Efficient
Software Transactional Memory
Vijay Menon
Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
2
Motivation
Multi-core architectures are mainstream– Software concurrency needed for scalability– Concurrent programming is hard– Difficult to reason about shared data
Traditional mechanism: Lock-based Synchronization– Hard to use– Must be fine-grain for scalability – Deadlocks– Not easily composable
New Solution: Transactional Memory (TM)– Simpler programming model: Atomicity, Consistency, Isolation– No deadlocks– Composability– Optimistic concurrency– Analogy
• GC : Memory allocation ≈ TM : Mutual exclusion
3
Composability
class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}
Thread-safe – but no scaling• ConcurrentHashMap (Java 5/JSR 166) does not help• Performance requires redesign from scratch & fine-grain locking
4
Transactional solution
class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}
Underlying system provide:• isolation (thread safety)• optimistic concurrency
5
Transactions are Composable
Scalability - 10,000,000 operations
0
1
2
3
4
0 4 8 12 16
# of Processors
Sca
lab
ilit
y
Synchronized Transactional
Scalability on 16-way 2.2 GHz Xeon System
6
Our System
A Java Software Transactional Memory (STM) System– Pure software implementation – Language extensions in Java– Integrated with JVM & JIT
Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Risc-like STM API– Compiler optimizations– Per-type word and object level conflict detection– Complete GC support
7
System Overview
Polyglot
ORP VM
McRT STM
StarJIT
Transactional Java
Java + STM API
Transactional STIR
Optimized T-STIR
Native Code
8
Transactional Java
Java + new language constructs:• Atomic: execute block atomically
• atomic {S}• Retry: block until alternate path possible
• atomic {… retry;…}• Orelse: compose alternate atomic blocks
• atomic {S1} orelse{S2} … orelse{Sn}• Tryatomic: atomic with escape hatch
• tryatomic {S} catch(TxnFailed e) {…}• When: conditionally atomic region
• when (condition) {S}
Builds on prior researchConcurrent Haskell, CAML, CILK, JavaHPCS languages: Fortress, Chapel, X10
9
Transactional Java → Java
Transactional Java
atomic {
S;
}
STM API• txnStart[Nested]• txnCommit[Nested]• txnAbortNested• txnUserRetry• ...
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
try {
S’;
break;
} finally {
if(!txnCommit(th))
continue;
}
}
10
JVM STM support
On-demand cloning of methods called inside transactions
Garbage collection support• Enumeration of refs in read set, write set & undo log
Extra transaction record field in each object• Supports both word & object granularity
Native method invocation throws exception inside transaction• Some intrinsic functions allowed
Runtime STM API• Wrapper around McRT-STM API
• Polyglot / StarJIT automatically generates calls to API
11
Background: McRT-STM
STM for• C / C++ (PPoPP 2006)• Java (PLDI 2006)
• Writes: – strict two-phase locking– update in place– undo on abort
• Reads: – versioning– validation before commit
• Granularity per type– Object-level : small objects– Word-level : large arrays
• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries
12
Ensuring Atomicity: Novel Combination
Memory Ops
Mode ↓ Reads Writes
Pessimistic Concurrency
Optimistic Concurrency
+ Caching effects+ Avoids lock operations
Quantitative results in PPoPP’06
+ In place updates+ Fast commits+ Fast reads
13
McRT-STM: Example
……atomic { B = A + 5;}…
……stmStart(); temp = stmRd(A); stmWr(B, temp + 5);stmCommit();…
STM read & write barriers before accessing memory inside transactions
STM tracks accesses & detects data conflicts
14
Transaction Record
Pointer-sized record per object / word
Two states:• Shared (low bit is 1)
– Read-only / multiple readers– Value is version number (odd)
• Exclusive– Write-only / single owner– Value is thread transaction descriptor (4-byte aligned)
Mapping• Object : slot in object• Field : hashed index into global record table
15
Transaction Record: Example
Every data item has an associated transaction record
TxR1
TxR2
TxR3
…TxRn
Object words hashinto table of TxRs
Hash is f(obj.hash, offset)
class Foo { int x; int y;}
vtblxy
TxRxy
vtbl Extra transactionrecord fieldObject
granularity
Wordgranularity
class Foo { int x; int y;}
hashxy
vtbl
16
Transaction Descriptor
Descriptor per thread– Info for version validation, lock release, undo on abort, …
Read and Write set : {<Ti, Ni>}– Ti: transaction record– Ni: version number
Undo log : {<Ai, Oi, Vi, Ki>}– Ai: field / element address– Oi: containing object (or null for static)– Vi: original value– Ki: type tag (for garbage collection)
In atomic region– Read operation appends read set– Write operation appends write set and undo log– GC enumerates read/write/undo logs
17
McRT-STM: Example
atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }
T1atomic { t1 = bar.x; t2 = bar.y; }
T2
• T1 copies foo into bar• T2 reads bar, but should not see intermediate values
Class Foo { int x; int y;};Foo bar, foo;
18
McRT-STM: Example
stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit();
T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();
T2
• T1 copies foo into bar• T2 reads bar, but should not see intermediate values
19
McRT-STM: Example
stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit;
T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();
T2
hdrx = 0y = 0
5hdr
x = 9y = 7
3foo bar
Reads <foo, 3> Reads <bar, 5>
T1
x = 9
<foo, 3>Writes <bar, 5>Undo <bar.x, 0>
T2 waits
y = 7
<bar.y, 0>
7
<bar, 7>
Abort
•T2 should read [0, 0] or should read [9,7]
Commit
20
Early Results: Overhead breakdown
STM time breakdown
0%
20%
40%
60%
80%
100%
Binary tree Hashtable Linked list Btree
Application
TLS access
STM write
STM commit
STM validate
STM read
Time breakdown on single processor
STM read & validation overheads dominate
Good optimization targets
21
System Overview
Polyglot
ORP VM
McRT STM
StarJIT
Transactional Java
Java + STM API
Transactional STIR
Optimized T-STIR
Native Code
22
Leveraging the JIT
StarJIT: High-performance dynamic compiler
• Identifies transactional regions in Java+STM code
• Differentiates top-level and nested transactions
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in STIR
Good compiler representation →
greater optimization opportunities
23
Representing Read/Write Barriers
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
…
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
stmWr(&a.x, 0);
stmWr(&a.z, t3)
}
Traditional barriers hide redundant locking/logging
24
An STM IR for Optimization
Redundancies exposed:
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
txnLogObjectInt(&a.y, a)
a.y = t2
txnOpenForRead(a)
if(a.z != 0) {
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = 0
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
a.z = t3
}
25
Optimized Code
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
txnLogObjectInt(&a.z, a)
a.y = t3
}
Fewer & cheaper STM operations
26
Compiler Optimizations for Transactions
Standard optimizations• CSE, Dead-code-elimination, …
• Careful IR representation exposes opportunities and enables optimizations with almost no modifications
• Subtle in presence of nesting
STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)
• Transaction-local object detection & barrier removal
• Partial inlining of STM fast paths to eliminate call overhead
27
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default– Word / mixed where beneficial
28
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
Prior STMs typically incur ~2x on 1PWith compiler optimizations:
- < 40% over no concurrency control- < 30% over synchronization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
HashMap TreeMap
% O
verh
ead
on
1P
Synchronized
No STM Opt
+Base STM Opt
+Immutability
+Txn Local
+Fast Path Inlining
29
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control
Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)
Atomic• Transactional version via “AtomicMap” wrapper
Atomic Prime• Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level
30
Scalability: 100% Gets
Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale
02468
10121416
0 4 8 12 16
# of Processors
Sp
eed
up
over
1P
Un
safe
Unsafe Synchronized Concurrent
Atomic (No Opt) Atomic
31
Scalability: 20% Gets / 80% Updates
ConcurrentHashMap thrashes on 16 segmentsAtomic still scales
0
24
6
8
1012
14
16
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic (No Opt) Atomic
32
20% Inserts and Removes
Atomic conflicts on entire bucket array- The array is an object
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic
33
20% Inserts and Removes: Word-Level
We still conflict on the single size field in java.util.HashMap
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent
Object Atomic Word Atomic
34
20% Inserts and Removes: Atomic Prime
Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime
35
20% Inserts and Removes: Mixed-Level
Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime
36
Scalability: java.util.TreeMap
02
46
810
1214
16
0 4 8 12 16
# of Processors
Scal
abili
ty
Unsafe Synchronized Atomic
100% Gets 80% Gets
Results similar to HashMap
0
0.2
0.4
0.6
0.8
1
1.2
0 4 8 12 16
# of Processors
Scal
abili
tySynchronized Atomic Atomic Prime
37
Scalability: OO7 – 80% Reads
“Coarse” atomic is competitive with medium-grain synchronization
Operations & traversal over synthetic database
0
1
2
3
4
5
6
0 4 8 12 16
# of Processors
Sca
lab
ilit
y
Atomic Synch (Coarse) Synch (Med.) Synch (Fine)
38
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet spot
Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
39
Research challenges
Performance– Compiler optimizations– Hardware support– Dealing with contention
Semantics– I/O & communication– Strong atomicity– Nested parallelism– Open transactions
Debugging & performance analysis tools
System integration
40
Conclusions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
41