programming multi-core systems 周枫 网易公司 2007-12-28 清华大学

75
Programming Multi-Core Systems 周周 周周周周 2007-12-28 周周周周 http://zhoufeng.net

Upload: patience-rebecca-fox

Post on 04-Jan-2016

277 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Programming Multi-Core Systems

周枫网易公司

2007-12-28 清华大学http://zhoufeng.net

Page 2: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Last Week

• A trend in software: “The Movement to Safe Languages”

• Improving dependability of system software without using safe languages

2

Page 3: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

A Trend in Hardware

“The Movement to Multi-core Processors”

• Originates from inability to increase processor clock rate

• Profound impact on architecture, OS and applications

3

Topic today: How to program multi-core systems effectively?

Page 4: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Why Multi-Core?

• Areas of improving CPU performance in last 30 years

1. Clock speed

2. Execution optimization (Cycles-Per-Instruction)

3. Cache

• All 3 are concurrency-agnostic

• Now: 1 disappears, 2 slows down, 3 still good

– 1: No 4Ghz Intel CPUs

– 2: Improves with new micro-architecture (e.g. Core vs. NetBurst)

4

Page 5: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

CPU Speed History

5From “Computer Architecture: A Quantitative Approach”, 4th edition, 2007

Page 6: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Reality Today

• Near-term performance drivers

1. Multicore

2. Hardware threads (a.k.a. Simultaneous Multi-Threading)

3. Cache

• 1 and 2 useful only when software is concurrent

6

“Concurrency is the next revolution in how we write software” --- “The Free Lunch is Over”

Page 7: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Costs/Problems of Concurrency

• Overhead of locks, message passing…

• Not all programs are parallelizable

• Programming concurrently is HARD

– Complex concepts: mutex, read-write lock, queue…

– Correct synchronization: race, deadlocks…

– Getting speed-up

7

Our focus today

Page 8: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Potential Multi-Core Apps

8

Application Category Examples

Server apps w/o shared state Apache web server

Server apps with shared state MMORPG game server

Stream-Sort data processing MapReduce, Yahoo Pig

Scientific computing (many different models)

BLAS, Monte Carlo, N-Body …

Machine learning HMM, EM algorithm

Graphics and games NVIDIA Cg, GPU computing

…… ……

Page 9: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Roadmap

• Overview of multi-core computing

• Overview of transactional programming

• Case: Transactions for safe/managed languages

• Case: Transactions for languages like C

9

Page 10: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Status Quo in Synchronization

• Current mechanism: manual locking

• Organization: lock for each shared structure

• Usage: (block) acquire access release

• Correctness issues

– Under-locking data races

– Acquires in different orders deadlock

• Performance issues

– Difficult to find right granularity

– Overhead of acquiring vs. allowed concurrency10

Page 11: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Transactions / Atomic Sections

• Databases has provided automatic concurrency control for 30 years: ACID transactions

• Vision:

– Atomicity

– Isolation

– Serialization only on conflicts

– (optional) Rollback/abort support

11

Question: Is it possible to provide database transactionsemantics to general programming?

Page 12: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Transactions vs. Manual Locks

• Manual locking issues:

– Under-locking

– Acquires in different orders

– Blocking

– Conservative serialization

12

• How transactions help:

– No explicit locks

– No ordering

– Can cancel transactions

– Serialization only on conflicts

Transactions: Potentially simpler and more efficient

Page 13: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Design Space

• Hardware Transactional Memory vs. software TM

• Granularity: object, word, block

• Update method

– Deferred: discard private copy on aborts

– Direct: control access to data, erase update on aborts

• Concurrency control

– Pessimistic: prevent conflicts by locking

– Optimistic: assumes no conflict and retry if there is

• …13

Page 14: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Hard Issues

• Communications, or side-effects

– File I/O

– Database accesses

• Interaction with other abstractions

– Garbage collection

– Virtual memory

• Work with existing languages, synchronization primitives, …

14

Page 15: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Why software TM?

• More flexible

• Easier to modify and evolve

• Integrate better with existing systems and languages

• Not limited to fixed-size hardware structures, e.g. caches

15

Page 16: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Proposed Language Support

Guard condition

16

// Insert into a doubly-linked list atomicallyatomic { newNode->prev = node; newNode->next = node->next; node->next->prev = newNode; node->next = newNode; }

atomic (queueSize > 0) { // remove item from queue and use it }

Page 17: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Transactions forManaged/Safe Languages

17

Page 18: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Intro

• Optimizing Memory Transactions,Harris et al. PLDI’06

• STM system from MSR Cambridge, for MSIL (.Net)

• Design features

– Object-level granularity

– Direct update

– Version numbers to track reads

– 2-phase locking to track write

– Compiler optimizations

18

Page 19: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

First-gen STM

• Very expensive. E.g. Harris+Fraser, OOPSLA ’03):

– Every load and store instruction logs information into a thread-local log

– A store instruction writes the log only

– A load instruction consults the log first

– At the end of the block: validate the log; and atomically commit it to shared memory

19

Page 20: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Direct Update STM

• Augment objects with (i) a lock, (ii) a version number

• Transactional write:

– Lock objects before they are written to (abort if another thread has that lock)

– Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread

– Make the update in place to the object

20

Page 21: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

• Transactional read:

– Log the object’s version number

– Read from the object itself

• Commit:

– Check the version numbers of objects we’ve read

– Increment the version numbers of object we’ve written, unlocking them

21

Page 22: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

22

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

T2’s log:

Page 23: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

23

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

T2’s log:

T1 reads from c1: logs that it saw

version 100

Page 24: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

24

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

c1.ver=100

T2’s log:

T2 also reads from c1: logs that it saw

version 100

Page 25: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

25

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100c2.ver=200

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

c1.ver=100

T2’s log:

Suppose T1 now reads from c2, sees it

at version 200

Page 26: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

26

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 10

c1

c1.ver=100lock: c1, 100

T2’s log:

Before updating c1, thread T2 must lock it: record old

version number

locked:T2

Page 27: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

27

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 11

c1

c1.ver=100lock: c1, 100c1.val=10

T2’s log:(2) After logging the old

value, T2 makes its update in place to c1

locked:T2

(1) Before updating c1.val, thread T2 must log the data

it’s going to overwrite

Page 28: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

28

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 11

c1

c1.ver=100lock: c1, 100c1.val=10

T2’s log:

ver=101

(2) T2’s transaction commits successfully. Unlock the object,

installing the new version number

(1) Check the version we locked matches the version

we previously read

Page 29: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

29

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 11

c1

c1.ver=100lock: c1, 100c1.val=10

T2’s log:

ver=101

(1) T1 attempts to commit. Check the versions it read are still up-to-date.

(2) Object c1 was updated from version 100 to 101, so T1’s transaction is

aborted and re-run.

Page 30: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

30

Compiler integration

• We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL)

– OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples)

– OpenForUpdate – before the first time we update an object

– LogOldValue – before the first time we write to a given field

atomic { … t += n.value; n = n.next; …}

OpenForRead(n);t = n.value;OpenForRead(n);n = n.next;

OpenForRead(n);t = n.value;n = n.next;

Source code Basic intermediate code Optimized intermediate code

Page 31: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

31

Runtime integration – garbage collection

atomic { …

}

1. GC runs while some threads are in atomic blocks

obj1.field = oldobj2.ver = 100obj3.locked @ 100

Page 32: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

32

Runtime integration – garbage collection

atomic { …

}

1. GC runs while some threads are in atomic blocks

obj1.field = oldobj2.ver = 100obj3.locked @ 100

2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

Page 33: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

33

Runtime integration – garbage collection

atomic { …

}

1. GC runs while some threads are in atomic blocks

obj1.field = oldobj2.ver = 100obj3.locked @ 100

2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back

Page 34: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

34

Runtime integration – garbage collection

atomic { …

}

1. GC runs while some threads are in atomic blocks

obj1.field = oldobj2.ver = 100obj3.locked @ 100

2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back

4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds

Page 35: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

35

Results: Against Previous WorkN

orm

alis

ed e

xecu

tion

time

Sequential baseline (1.00x)

Coarse-grained locking (1.13x)

Fine-grained locking (2.57x)

Direct-update STM (2.04x)

Direct-update STM + compiler integration

(1.46x)

Harris+Fraser WSTM (5.69x)

Workload: operations on a red-black tree, 1

thread, 6:1:1 lookup:insert:delete mix

with keys 0..65535

Scalable to multicore

Page 36: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Scalability (µ-benchmark)

36#threads

Fine-grained locking

Direct-update STM + compiler integration

WSTM (atomic blocks)

Coarse-grained locking

Mic

rose

cond

s pe

r op

erat

ion

DSTM (API)

OSTM (API)

Page 37: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

37

Results: long running tests2

.04

2.1

1

6.0

5

5.8

4

3.9

1

2.0

8

2.0

5

4.3

2

3.1

6

2.6

1

1.7

2

1.6

8

1.0

0

1.0

0

1.0

0

1.0

0

1.0

0

0

1

2

3

4

5

6

7

1 2 3 4 5tree goskip merge-sort xlisp

Norm

alis

ed e

xecu

tion t

ime

10.8 73 16

2

Direct-update STM

Run-time filteringCompile-time optimizations

Original application (no tx)

Page 38: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Summary

• A pure software implementation can perform well and scale to vast transactions

– Direct update

– Pessimistic & optimistic CC

– Compiler support & optimizations

• Still need a better understanding of realistic workload distributions

38

Page 39: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Atomic Sections for C-like Languages

39

Page 40: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

40

Intro

• Autolocker: Synchronization Inference for Atomic SectionsBill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL 2006

• Question: Can we have language support for atomic sections for languages like C?

– No object meta-data

– No type safety

– No garbage collection

• Answer: Let programmer help a bit (w/ annotations)

– And do a simpler one: NO aborts!

Page 41: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

41

Autolocker: C + Atomic Sections

• Shared data is protected by annotated locks

• Threads access shared data in atomic sections:

• Threads never deadlock (due to Autolocker)

• Threads never race for protected data

• How can we implement this semantics?

mutex m;int shared_var protected_by(m);atomic { ... x = shared_var; ... }

Code runs as if a single lock protects all atomic sections

Page 42: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

42

Autolocker Transformation

• Autolocker is a source-to-source transformation

int m1, m2;int x;int y;

begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

atomic { x = 3; y = 2;}

Autolocker code C code

Page 43: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

43

Autolocker Transformation

• Autolocker is a source-to-source transformation

int m1, m2;int x;int y;

begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

atomic { x = 3; y = 2;}

Atomic sections can be nested arbitrarilyThe nesting level is tracked at runtime

Autolocker code C code

Page 44: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

44

Autolocker Transformation

• Autolocker is a source-to-source transformation

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

atomic { x = 3; y = 2;}

Locks are acquired as neededLock acquisitions are reentrant

Autolocker code C code

int m1, m2;int x;int y;

begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();

Page 45: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

45

Autolocker Transformation

• Autolocker is a source-to-source transformation

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

atomic { x = 3; y = 2;}

Locks are released when outermost section endsStrict two-phase locking: guarantees atomicity

Autolocker code C code

int m1, m2;int x;int y;

begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();

Page 46: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

46

Autolocker Transformation

• Autolocker is a source-to-source transformation

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

atomic { x = 3; y = 2;}

Locks are acquired in a global orderAcquiring locks in order will never lead to deadlock

Autolocker code C code

int m1, m2;int x;int y;

begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();

Page 47: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

47

Outline

• Introduction

– semantics

– usage model and benefits

• Autolocker algorithm

– match locks to data

– order lock acquisitions

– insert lock acquisitions

• Related work

• Experimental evaluation

• Conclusion

Page 48: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

48

finish heremany fine-grained locks

low contentionhigh parallelism

start hereone coarse-grained lock

lots of contentionlittle parallelism

• Typical process for writing threaded software:

• Linux kernel evolved to SMP this way

• Autolocker helps you program in this style

Autolocker Usage Model

shared data

threads

lock

Page 49: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

49

Granularity and Autolocker

• In Autolocker, annotations control performance:

• Simpler than fixing all uses of shared_var

• Changing annotations won’t introduce bugs

– no deadlocks

– no new race conditions

int shared_var protected_by(kernel_lock);

int shared_var protected_by(fs->lock);

Page 50: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

50

Outline

• Introduction

– semantics

– usage model and benefits

• Autolocker algorithm

– match locks to data

– order lock acquisitions

– insert lock acquisitions

• Related work

• Experimental evaluation

• Conclusion

Page 51: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

51

generate globallock order

Algorithm Summary

atomic { …}

C programwith atomic

sections

begin acq L1, L2;end

lock order

cyclic?

yesno

fail and reportpotential deadlock

C programwith acquirestatements

insert orderedlock acquisitions

lockrequirements

remove redundantacquisitions

begin acq L2;end

match locks to data

Page 52: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

52

Algorithm Summary

atomic { …}

C programwith atomic

sections

begin acq L1, L2;end

lock order

cyclic?

yesno

fail and reportpotential deadlock

C programwith acquirestatements

match locks to data

generate globallock order

insert orderedlock acquisitions

lockrequirements

remove redundantacquisitions

begin acq L2;end

Page 53: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

53

Matching Locks to Data

foo bar

quxbaz

functions

C program withatomic sections

void foo() { atomic { use m2; y = 3; use m1; x = 2; }}

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

Symbol table

use L ≡

the lock L must already be held when this statement is reached

Page 54: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

54

Algorithm Summary

atomic { …}

C programwith atomic

sections

begin acq L1, L2;end

lock order

cyclic?

yesno

fail and reportpotential deadlock

C programwith acquirestatements

generate globallock order

insert orderedlock acquisitions

lockrequirements

remove redundantacquisitions

begin acq L2;end

match locks to data

Page 55: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

55

Acquisition Placement

• Assume there’s an acyclic order “<” on locks

• We need to turn uses into acquires:

void foo() { atomic { use m2; y = 3; use m1; x = 2; }}

global lock order

We must acquire m2…

m1 < m2

Page 56: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

56

Acquisition Placement

• Assume there’s an acyclic order “<” on locks

• We need to turn uses into acquires:

void foo() { atomic { use m2; y = 3; use m1; x = 2; }}

global lock order

We must acquire m2…

... but since m1 is needed later

… and it’s ordered first

m1 < m2

Page 57: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

57

Acquisition Placement

• Assume there’s an acyclic order “<” on locks

• We need to turn uses into acquires:

void foo() { atomic { acquire m1; acquire m2; y = 3; use m1; x = 2; }}

m1 < m2

global lock order

We must acquire m2…

... but since m1 is needed later

… and it’s ordered first

… we acquire m1 first

Page 58: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

58

Acquisition Placement

• Assume there’s an acyclic order “<” on locks

• We need to turn uses into acquires:

void foo() { atomic { acquire m1; acquire m2; y = 3; use m1; x = 2; }}

global lock order

Preemptive acquiresPreemptively take locks that:• are used later• but ordered before

m1 < m2

Page 59: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

59

Acquisition Placement

• Assume there’s an acyclic order “<” on locks

• We need to turn uses into acquires:

void foo() { atomic { acquire m1; acquire m2; y = 3; acquire m1; x = 2; }}

global lock order

m1 < m2

Eventually we’ll optimizeaway this acquisition

Page 60: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

60

match locks to data

Algorithm Summary

atomic { …}

C programwith atomic

sections

begin acq L1, L2;end

lock order

cyclic?

yesno

fail and reportpotential deadlock

C programwith acquirestatements

generate globallock order

insert orderedlock acquisitions

lockrequirements

remove redundantacquisitions

begin acq L2;end

Page 61: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

61

• Some lock orders break the insertion algorithm!

• We say that “p->m < m” is not a feasible order

• So how do we find a feasible lock order?

What Is a Good Lock Order?

use m;p = p->next;use p->m;

Assume p->m < macquire p->m;acquire m;p = p->next;acquire p->m;

If “p->m < m” then m cannot be acquired before any lock called p->m

… this lock has never been acquired before.It is acquired after m!

Because of the assignment…via insertion algo

Page 62: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

62

Finding a Feasible Lock Order

• We want only feasible orders

• We search for any code matching this pattern:

• Any feasible order must ensure e1 < e2

• Scan through all code to find constraints

use e1;…maykill e2;…use e2;

e2 cannot be acquired before the update.It’s value is different above there.

Signifies anything that may affect e2’s value(like “p = p->next” when e2 = “p->m”)

e1 cannot be acquired after it is used.

Page 63: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

63

Finding a Feasible Lock Order

foo bar

quxbaz

functions

Constraints

m1 < p->m’

p->m < p->m’

m1 < r->mm3 < p->m

q->m < p->m

m1r->mq->mp->mp->m’

Global Lock Order

search forinfeasiblepatterns

topological sort

Page 64: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

64

Algorithm Summary

atomic { …}

C programwith atomic

sections

begin acq L1, L2;end

lock order

cyclic?

yesno

fail and reportpotential deadlock

C programwith acquirestatements

generate globallock order

insert orderedlock acquisitions

lockrequirements

remove redundantacquisitions

begin acq L2;end

match locks to data

Page 65: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

65

Outline

• Introduction

– semantics

– usage model and benefits

• Autolocker algorithm

– computing protections

– lock ordering

– acquisition placement

• Related work

• Experimental evaluation

• Conclusion

Page 66: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

66

Comparison

• Transactional memory with optimistic CC

– Threads work locally and commit when done

– If two threads conflict, one rolls back & restarts

– Benefit: no complex static analysis

• Drawbacks:

– software versions: lots of copying, can be slow

– hardware versions: need new hardware

– both: some operations cannot be rolled back (e.g., fire missile)

• How does this compare with Autolocker?

Page 67: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

67

Experimental Questions

• Question 1:

– What is the performance cost of using Autolocker?

– How does it compare to other concurrency models?

Page 68: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

68

Concurrent Hash Table

• Simple microbenchmark

• Goal: Same performance as hand-optimized

– low overhead

– high scalability

• Compared Autolocker to:

– manual locking

– Fraser’s object-based transactional memory mgr.

– Ennal’s revised transactional memory mgr.

Page 69: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

69

Hash Table Benchmark

Machine: four processors, 1.9 GHz, HyperThreading, 1 GB RAMEach datapoint is the average of 4 runs after 1 warmup run

Page 70: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

70

Experimental Questions

• Question 1:

– What is the performance cost of using Autolocker?

– How does it compare to other concurrency models?

• Question 2:

– Autolocker may reject code for “potential deadlocks”

– Does it accept enough programs to be useful?

– Will it only accept easy, coarse-grained policies?

Page 71: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

71

AOLServer

• Threaded web server using manual locking

• Goal: Preserve the original locking policy with AL

Size52,589 lines

82 modules

Changes143 atomic sections added

126 types annotated with protections

Problems

175 “trusted” casts (23 worrisome)

78/82 modules used original locking policies

Performance negligible impact (~3%)

Page 72: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

72

Conclusion

• Contributions:

– a new model for programming parallel systems

– a transformation tool to implement it

• Benefits:

– performance close to well-written manual locking

– freedom from deadlocks

– freedom from races on protected data

– very good performance when “ops/sync” is high

– increase performance without increasing bugs!

Page 73: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

References

• Optimizing memory transactions, Tim Harris, Mark Plesko, Avraham Shinnar, David Tarditi, PLDI’06

• Autolocker: Synchronization Inference for Atomic Sections, Bill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL’06

• The free lunch is over: A fundamental turn toward concurrency in software, Herb Sutter, Dr. Dobb’s Journal, 2005

• Unlocking Concurrency, Ali-Reza Adl-Tabatabai et al, ACM Queue, 2006

• Transactional Memory, James Larus, Ravi Rajwar, Morgan Kaufmann, 2006 (book)

Contact me: [email protected]

http://zhoufeng.net73

Page 74: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

Backup slides

74

Page 75: Programming Multi-Core Systems 周枫 网易公司 2007-12-28 清华大学

A Lot of Questions to Answer

75From “The Landscape of Parallel Computing Research: A View from Berkeley”