compiler and runtime support for efficient software transactional memory

Compiler and Runtime Supportfor Efficient

Software Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

2

Motivation

Multi-core architectures are mainstream– Software concurrency needed for scalability– Concurrent programming is hard– Difficult to reason about shared data

Traditional mechanism: Lock-based Synchronization– Hard to use– Must be fine-grain for scalability – Deadlocks– Not easily composable

New Solution: Transactional Memory (TM)– Simpler programming model: Atomicity, Consistency, Isolation– No deadlocks– Composability– Optimistic concurrency– Analogy

• GC : Memory allocation ≈ TM : Mutual exclusion

3

Composability

class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}

Thread-safe – but no scaling• ConcurrentHashMap (Java 5/JSR 166) does not help• Performance requires redesign from scratch & fine-grain locking

4

Transactional solution

class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}

Underlying system provide:• isolation (thread safety)• optimistic concurrency

5

Transactions are Composable

Scalability - 10,000,000 operations

0

1

2

3

4

0 4 8 12 16

# of Processors

Sca

lab

ilit

y

Synchronized Transactional

Scalability on 16-way 2.2 GHz Xeon System

6

Our System

A Java Software Transactional Memory (STM) System– Pure software implementation – Language extensions in Java– Integrated with JVM & JIT

Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Risc-like STM API– Compiler optimizations– Per-type word and object level conflict detection– Complete GC support

7

System Overview

Polyglot

ORP VM

McRT STM

StarJIT

Transactional Java

Java + STM API

Transactional STIR

Optimized T-STIR

Native Code

8

Transactional Java

Java + new language constructs:• Atomic: execute block atomically

• atomic {S}• Retry: block until alternate path possible

• atomic {… retry;…}• Orelse: compose alternate atomic blocks

• atomic {S1} orelse{S2} … orelse{Sn}• Tryatomic: atomic with escape hatch

• tryatomic {S} catch(TxnFailed e) {…}• When: conditionally atomic region

• when (condition) {S}

Builds on prior researchConcurrent Haskell, CAML, CILK, JavaHPCS languages: Fortress, Chapel, X10

9

Transactional Java → Java

Transactional Java

atomic {

S;

}

STM API• txnStart[Nested]• txnCommit[Nested]• txnAbortNested• txnUserRetry• ...

Standard Java + STM API

while(true) {

TxnHandle th = txnStart();

try {

S’;

break;

} finally {

if(!txnCommit(th))

continue;

}

}

10

JVM STM support

On-demand cloning of methods called inside transactions

Garbage collection support• Enumeration of refs in read set, write set & undo log

Extra transaction record field in each object• Supports both word & object granularity

Native method invocation throws exception inside transaction• Some intrinsic functions allowed

Runtime STM API• Wrapper around McRT-STM API

• Polyglot / StarJIT automatically generates calls to API

11

Background: McRT-STM

STM for• C / C++ (PPoPP 2006)• Java (PLDI 2006)

• Writes: – strict two-phase locking– update in place– undo on abort

• Reads: – versioning– validation before commit

• Granularity per type– Object-level : small objects– Word-level : large arrays

• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries

12

Ensuring Atomicity: Novel Combination

Memory Ops

Mode ↓ Reads Writes

Pessimistic Concurrency

Optimistic Concurrency

+ Caching effects+ Avoids lock operations

Quantitative results in PPoPP’06

+ In place updates+ Fast commits+ Fast reads

13

McRT-STM: Example

……atomic { B = A + 5;}…

……stmStart(); temp = stmRd(A); stmWr(B, temp + 5);stmCommit();…

STM read & write barriers before accessing memory inside transactions

STM tracks accesses & detects data conflicts

14

Transaction Record

Pointer-sized record per object / word

Two states:• Shared (low bit is 1)

– Read-only / multiple readers– Value is version number (odd)

• Exclusive– Write-only / single owner– Value is thread transaction descriptor (4-byte aligned)

Mapping• Object : slot in object• Field : hashed index into global record table

15

Transaction Record: Example

Every data item has an associated transaction record

TxR1

TxR2

TxR3

…TxRn

Object words hashinto table of TxRs

Hash is f(obj.hash, offset)

class Foo { int x; int y;}

vtblxy

TxRxy

vtbl Extra transactionrecord fieldObject

granularity

Wordgranularity

class Foo { int x; int y;}

hashxy

vtbl

16

Transaction Descriptor

Descriptor per thread– Info for version validation, lock release, undo on abort, …

Read and Write set : {<Ti, Ni>}– Ti: transaction record– Ni: version number

Undo log : {<Ai, Oi, Vi, Ki>}– Ai: field / element address– Oi: containing object (or null for static)– Vi: original value– Ki: type tag (for garbage collection)

In atomic region– Read operation appends read set– Write operation appends write set and undo log– GC enumerates read/write/undo logs

17

McRT-STM: Example

atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }

T1atomic { t1 = bar.x; t2 = bar.y; }

T2

• T1 copies foo into bar• T2 reads bar, but should not see intermediate values

Class Foo { int x; int y;};Foo bar, foo;

18

McRT-STM: Example

stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit();

T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

T2

• T1 copies foo into bar• T2 reads bar, but should not see intermediate values

19

McRT-STM: Example

stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit;

T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

T2

hdrx = 0y = 0

5hdr

x = 9y = 7

3foo bar

Reads <foo, 3> Reads <bar, 5>

T1

x = 9

<foo, 3>Writes <bar, 5>Undo <bar.x, 0>

T2 waits

y = 7

<bar.y, 0>

7

<bar, 7>

Abort

•T2 should read [0, 0] or should read [9,7]

Commit

20

Early Results: Overhead breakdown

STM time breakdown

0%

20%

40%

60%

80%

100%

Binary tree Hashtable Linked list Btree

Application

TLS access

STM write

STM commit

STM validate

STM read

Time breakdown on single processor

STM read & validation overheads dominate

Good optimization targets

21

System Overview

Polyglot

ORP VM

McRT STM

StarJIT

Transactional Java

Java + STM API

Transactional STIR

Optimized T-STIR

Native Code

22

Leveraging the JIT

StarJIT: High-performance dynamic compiler

• Identifies transactional regions in Java+STM code

• Differentiates top-level and nested transactions

• Inserts read/write barriers in transactional code

• Maps STM API to first class opcodes in STIR

Good compiler representation →

greater optimization opportunities

23

Representing Read/Write Barriers

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

…

stmWr(&a.x, t1)

stmWr(&a.y, t2)

if(stmRd(&a.z) != 0) {

stmWr(&a.x, 0);

stmWr(&a.z, t3)

}

Traditional barriers hide redundant locking/logging

24

An STM IR for Optimization

Redundancies exposed:

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnOpenForWrite(a)

txnLogObjectInt(&a.y, a)

a.y = t2

txnOpenForRead(a)

if(a.z != 0) {

txnOpenForWrite(a)


a.x = 0

txnOpenForWrite(a)

txnLogObjectInt(&a.z, a)

a.z = t3

}

25

Optimized Code

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)


a.x = t1

txnLogObjectInt(&a.y, a)

a.y = t2

if(a.z != 0) {

a.x = 0

txnLogObjectInt(&a.z, a)

a.y = t3

}

Fewer & cheaper STM operations

26

Compiler Optimizations for Transactions

Standard optimizations• CSE, Dead-code-elimination, …

• Careful IR representation exposes opportunities and enables optimizations with almost no modifications

• Subtle in presence of nesting

STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)

• Transaction-local object detection & barrier removal

• Partial inlining of STM fast paths to eliminate call overhead

27

Experiments

16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)

Workloads• Hashtable, Binary tree, OO7 (OODBMS)

– Mix of gets, in-place updates, insertions, and removals

• Object-level conflict detection by default– Word / mixed where beneficial

28

Effective of Compiler Optimizations

1P overheads over thread-unsafe baseline

Prior STMs typically incur ~2x on 1PWith compiler optimizations:

- < 40% over no concurrency control- < 30% over synchronization

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

HashMap TreeMap

% O

verh

ead

on

1P

Synchronized

No STM Opt

+Base STM Opt

+Immutability

+Txn Local

+Fast Path Inlining

29

Scalability: Java HashMap Shootout

Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control

Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper

Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)

Atomic• Transactional version via “AtomicMap” wrapper

Atomic Prime• Transactional version with minor hand optimization

• Tracks size per segment ala ConcurrentHashMap

Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level

30

Scalability: 100% Gets

Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale

02468

10121416

0 4 8 12 16

# of Processors

Sp

eed

up

over

1P

Un

safe

Unsafe Synchronized Concurrent

Atomic (No Opt) Atomic

31

Scalability: 20% Gets / 80% Updates

ConcurrentHashMap thrashes on 16 segmentsAtomic still scales

0

24

6

8

1012

14

16

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent Atomic (No Opt) Atomic

32

20% Inserts and Removes

Atomic conflicts on entire bucket array- The array is an object

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent Atomic

33

20% Inserts and Removes: Word-Level

We still conflict on the single size field in java.util.HashMap

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent

Object Atomic Word Atomic

34

20% Inserts and Removes: Atomic Prime

Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime

35

20% Inserts and Removes: Mixed-Level

Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime

36

Scalability: java.util.TreeMap

02

46

810

1214

16

0 4 8 12 16

# of Processors

Scal

abili

ty

Unsafe Synchronized Atomic

100% Gets 80% Gets

Results similar to HashMap

0

0.2

0.4

0.6

0.8

1

1.2

0 4 8 12 16

# of Processors

Scal

abili

tySynchronized Atomic Atomic Prime

37

Scalability: OO7 – 80% Reads

“Coarse” atomic is competitive with medium-grain synchronization

Operations & traversal over synthetic database

0

1

2

3

4

5

6

0 4 8 12 16

# of Processors

Sca

lab

ilit

y

Atomic Synch (Coarse) Synch (Med.) Synch (Fine)

38

Key Takeaways

Optimistic reads + pessimistic writes is nice sweet spot

Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe

- 10-30% over synchronized

Simple atomic wrappers sometimes good enough

Minor modifications give competitive performance to complex fine-grain synchronization

Word-level contention is crucial for large arrays

Mixed contention provides best of both

39

Research challenges

Performance– Compiler optimizations– Hardware support– Dealing with contention

Semantics– I/O & communication– Strong atomicity– Nested parallelism– Open transactions

Debugging & performance analysis tools

System integration

40

Conclusions

Rich transactional language constructs in Java

Efficient, first class nested transactions

Risc-like STM API

Compiler optimizations

Per-type word and object level conflict detection

Complete GC support

compiler and runtime support for efficient software transactional memory

Documents