consistency oblivious programming

Consistency Oblivious Programming

Hillel AvniTel Aviv University

Agenda Transactional Memory and Locking

Consistency Oblivious Programming (COP)

COP with STM

COP With HTM

Future Work

Global Lock

Easy to use

Composable - Concatenate critical sections

Not scalable

Fine Grain Locking

Hard to use

Not Composable

Scalable

Lazy linked list is a good example…

Lazy Traversal

b d ea

add(c) Aha!

Lock and Validate

b d ea

add(c) Yes, b still points to d

Perform Updates and Release Locks

b d ea

add(c)

Transactional Memory

Easy to use

Composable

Scalable

How is it done?

Java (Duece)bool CAS(int location, int expected, int new val){ atomic { if (location != expected) return false; location = new val; } return true;}

bool CAS(int location, int expected, int new val){ __transaction_atomic { if (location != expected) return false; location = new val; } return true;}

C/C++ (GCC-4.7)

Software Transactional Memory

Different algorithms are used. Different algorithms are used.

consistency checkingconsistency checking

rollbackrollback

Compiler recognizes shared accesses.

STM Problem - Overheadtemplate <typename V> static V load(const V* addr, ls_modifier mod)

if (unlikely(mod == RfW))

pre_write(addr, sizeof(V));

return *addr;

if (unlikely(mod == RaW))

return *addr;

gtm_thread *tx = gtm_thr();

gtm_rwlog_entry* log = pre_load(tx, addr, sizeof(V));

V v = *addr;

atomic_thread_fence(memory_order_acquire);

post_load(tx, log);

return v;

load function from GCC 4.8.1load function from GCC 4.8.1

STM Problem - Overhead static gtm_rwlog_entry* pre_load(gtm_thread *tx, const void* addr, size_t len)

size_t log_start = tx->readlog.size();

gtm_word snapshot = tx->shared_state.load(memory_order_relaxed);

gtm_word locked_by_tx = ml_mg::set_locked(tx);

size_t orec = ml_mg::get_orec(addr);

size_t orec_end = ml_mg::get_orec_end(addr, len);

gtm_word o = o_ml_mg.orecs[orec].load(memory_order_acquire);

if (likely (!ml_mg::is_more_recent_or_locked(o, snapshot))) {

success:

gtm_rwlog_entry *e = tx->readlog.push();

e->orec = o_ml_mg.orecs + orec; e->value = o;

else if (!ml_mg::is_locked(o)) {snapshot = extend(tx); goto success; } else {

if (o != locked_by_tx)

tx->restart(RESTART_LOCKED_READ);}

orec = o_ml_mg.get_next_orec(orec); }

while (orec != orec_end);

return &tx->readlog[log_start];

load always call pre_loadload always call pre_load

STM Problem - Overhead

static void post_load(gtm_thread *tx, gtm_rwlog_entry* log)

for (gtm_rwlog_entry *end = tx->readlog.end(); log != end; log++)

gtm_word o = log->orec->load(memory_order_relaxed);

if (log->value != o)

tx->restart(RESTART_VALIDATE_READ);

} and post_loadand post_load

Compare to mov eax, [ebx]on x86

Hardware Transactional Memory

Exploit native cache coherenceExploit native cache coherence

consistency checkingconsistency checking

rollbackrollback

HTM Problem – Resources

limitslimits

cache size limits data footprintcache size limits data footprint

A transaction cannot commit if it isA transaction cannot commit if it is

too bigtoo big

too slowtoo slow

quantum size limits durationquantum size limits duration

All TM Problem – False Conflicts

Any address that was encountered during the transaction is monitored until the endof that transaction.

An address may abort a transaction long After it is not relevant…

Any address that was encountered during the transaction is monitored until the endof that transaction.

An address may abort a transaction long After it is not relevant…

COP with STM

COP With HTM

Future Work

COP Operation

• In non transactional mode:– Execute the read-only prefix of the

operation and record its output.

• In transactional mode:– Verify output is correct.– Perform updates.

COP Example – RB Tree

Add 26 – Tree Unbalanced

TM Search 26TM Search 26

Tree Balanced

TM Search continues from 27TM Search continues from 27

Conflict and AbortConflict and Abort

Add 26 – Tree Unbalanced

COP Search 26COP Search 26

Tree Balanced

TM Search continues from 27TM Search continues from 27

FoundFound

COP RB-Tree VerifyTo facilitate verification:

• all nodes in the RB-Tree are connected in a successor-predecessor doubly linked list, and each node has a live mark.

• Search returns a node n with k or a leaf with k’s successor or predecessor.

COP RB-Tree Suffix• Resume a transaction

• Verify:– k found and n is live – done.– K not found, check:

• (n.k>k>n.pred.k && !n.right) or (n.k<k<n.succ.k && !n.left)

• If verification failed – abort the transaction.

• Complete updates, add / remove / rebalance, using n.

COP Template for opstart-transaction

any-code

suspend-transaction

output = op-rop();

resume-transaction

If(not(op-verify(output)))

abort-transaction

op-complete(output)

any-code

end-transaction

COP CorrectnessThe underlying TM:• Transactional Regular Registers

The COP algorithm:• Obliviousness• Verifiability• Separation

We prove that if the TM yields transactional regular registers, and the COP algorithm demonstrates obliviousness, verifiability, and separation, than the COP operation is linearizeable.

COP with STM

COP With HTM

Future Work

STM Algorithm• GCC default STM algorithm is the one that proved to

be the most efficient and scalable in most scenarios:– Write Through (WT)– Encounter Time Locking (ETL)– Multi Lock (ML)

STM: WT – ETL - ML

1. RV Shared Version Clock2. On Read: check unlocked and

v# <= RV then add to read-Set3. On write: check v# <= RV, lock,

and add to undo-Set4. WV = F&I(VClock)5. Validate that in the read-set

each v# <= RV6. Release locks with v# WV

100 Shared Version Clock

87 0 87 0

99 0 99 0

50 0 50 0

Mem Locks

Commit

100 RV

100120121

GCC Constructs__transaction_atomic{}: Mark the transaction.

__transaction_cancel: Explicit abort.

__attribute__((transaction_safe)): Instrument the code.

__attribute__((transaction_pure)):

Do not instrument the code. We will show this attribute can be used efficiently as __transaction_suspend with WT – ETL – ML default STM algorithm in GCC.

pure = suspend • Transactional Regular Registers – All values upto

one architecture-word size are written and read atomically. The rollback may use memcpy, but the memcpy is optimized to write maximal alignment.

• Now we will compare the future Power architecture HTM suspended mode, to transaction_pure with WT-ETL-ML STM algorithm.

Power tsuspend - tresume1. Until failure occurs, load instructions that access

memory locations that were transactionally written by the same thread will return the transactionally written data.

2. In the event of transaction failure, failure recording is performed, but failure handling is deferred until transactional execution is resumed.

3. The initiation of a new transaction is prevented.

4. Store instructions that access memory locations that have been accessed transactionally (due to load or store) by the same thread will cause the transaction to fail.

RB – 1M sz – 20%U - 10 op/tx

RB – 1K sz – 8 Threads – 20% U

COP with STM

COP With HTM

Future Work

Haswell HTM with COPThere is no suspend mode, so to compose COP

operations, we execute all ROP before the transaction. This limits the composition to one writing COP operation in a transaction at most.

Capacity and Cache AssociativityPacked Memory Array (PMA) search is done by divide

and conquer. Assume a PMA size is 0x800000, and it starts at address 0. A searches for an item that is found in address 0x0…0x7FFF, must go through the addresses:

0x400000 0x20000 0x100000 0x80000

0x40000 0x20000 0x10000 0x8000

As cache size in Haswell is 0x8000, all these addresses have the same cache index (0), and will always abort.

RB-Tree Capacity Aborts

RB-Tree Conflict Aborts

COP with STM

COP With HTM

Future Work

Data StructuresWe already have COP versions of:• RB-Tree• Linked list• PMA• Cache Oblivious B-Tree• Leaplist (k-ary skip list, tailored for range queries)

Can we design more COP data structures?

ApplicationsUse COP in applications.

Many applications use shared data structures, so it is interesting to see the impact of COP on their performance.

InfrastructureAdd statistics (transactional accesses, conflicts) to GCC.

Add real suspend-mode to GCC, hardware.

TheoryHow to make transformation to COP automatic?

Is COP applicable outside the data-structures area?

Bounds on the amount of transactional accesses?

Bounds on the amount of false conflicts?

Thank You

consistency oblivious programming

Documents

smore: semi-oblivious traﬃc engineering

oblivious transfer (ot)

oblivious algebraic data types

oblivious transfer

cache oblivious algorithms and data structures theory and...

oblivious production pack

cache-oblivious algorithms

presentatie oblivious turing machines

fast vector oblivious linear evaluation from ring learning...

on an oblivious recommender - recsys

forwarding programming in protocol- oblivious instruction...

4developers 2015: java memory consistency model or intro to...

cache-oblivious programming. story so far we have studied...

introduction to cache-oblivious algorithms

cache-oblivious programming

oblivious ram - di.ens.fr

cache-oblivious algorithms - cacs home

cache-oblivious data structures -...

30 november 2005 foundations of logic and constraint...

rational oblivious transfer