programming multi-core systems 周枫网易公司 2007-12-28 清华大学

Programming Multi-Core Systems

周枫网易公司

2007-12-28 清华大学http://zhoufeng.net

Last Week

• A trend in software: “The Movement to Safe Languages”

• Improving dependability of system software without using safe languages

2

A Trend in Hardware

“The Movement to Multi-core Processors”

• Originates from inability to increase processor clock rate

• Profound impact on architecture, OS and applications

3

Topic today: How to program multi-core systems effectively?

Why Multi-Core?

• Areas of improving CPU performance in last 30 years

1. Clock speed

2. Execution optimization (Cycles-Per-Instruction)

3. Cache

• All 3 are concurrency-agnostic

• Now: 1 disappears, 2 slows down, 3 still good

– 1: No 4Ghz Intel CPUs

– 2: Improves with new micro-architecture (e.g. Core vs. NetBurst)

4

CPU Speed History

5From “Computer Architecture: A Quantitative Approach”, 4th edition, 2007

Reality Today

• Near-term performance drivers

1. Multicore

2. Hardware threads (a.k.a. Simultaneous Multi-Threading)

3. Cache

• 1 and 2 useful only when software is concurrent

6

“Concurrency is the next revolution in how we write software” --- “The Free Lunch is Over”

Costs/Problems of Concurrency

• Overhead of locks, message passing…

• Not all programs are parallelizable

• Programming concurrently is HARD

– Complex concepts: mutex, read-write lock, queue…

– Correct synchronization: race, deadlocks…

– Getting speed-up

7

Our focus today

Potential Multi-Core Apps

8

Application Category Examples

Server apps w/o shared state Apache web server

Server apps with shared state MMORPG game server

Stream-Sort data processing MapReduce, Yahoo Pig

Scientific computing (many different models)

BLAS, Monte Carlo, N-Body …

Machine learning HMM, EM algorithm

Graphics and games NVIDIA Cg, GPU computing

…… ……

Roadmap

• Overview of multi-core computing

• Overview of transactional programming

• Case: Transactions for safe/managed languages

• Case: Transactions for languages like C

9

Status Quo in Synchronization

• Current mechanism: manual locking

• Organization: lock for each shared structure

• Usage: (block) acquire access release

• Correctness issues

– Under-locking data races

– Acquires in different orders deadlock

• Performance issues

– Difficult to find right granularity

– Overhead of acquiring vs. allowed concurrency10

Transactions / Atomic Sections

• Databases has provided automatic concurrency control for 30 years: ACID transactions

• Vision:

– Atomicity

– Isolation

– Serialization only on conflicts

– (optional) Rollback/abort support

11

Question: Is it possible to provide database transactionsemantics to general programming?

Transactions vs. Manual Locks

• Manual locking issues:

– Under-locking

– Acquires in different orders

– Blocking

– Conservative serialization

12

• How transactions help:

– No explicit locks

– No ordering

– Can cancel transactions

– Serialization only on conflicts

Transactions: Potentially simpler and more efficient

Design Space

• Hardware Transactional Memory vs. software TM

• Granularity: object, word, block

• Update method

– Deferred: discard private copy on aborts

– Direct: control access to data, erase update on aborts

• Concurrency control

– Pessimistic: prevent conflicts by locking

– Optimistic: assumes no conflict and retry if there is

• …13

Hard Issues

• Communications, or side-effects

– File I/O

– Database accesses

• Interaction with other abstractions

– Garbage collection

– Virtual memory

• Work with existing languages, synchronization primitives, …

14

Why software TM?

• More flexible

• Easier to modify and evolve

• Integrate better with existing systems and languages

• Not limited to fixed-size hardware structures, e.g. caches

15

Proposed Language Support

Guard condition

16

// Insert into a doubly-linked list atomicallyatomic { newNode->prev = node; newNode->next = node->next; node->next->prev = newNode; node->next = newNode; }

atomic (queueSize > 0) { // remove item from queue and use it }

Transactions forManaged/Safe Languages

17

Intro

• Optimizing Memory Transactions,Harris et al. PLDI’06

• STM system from MSR Cambridge, for MSIL (.Net)

• Design features

– Object-level granularity

– Direct update

– Version numbers to track reads

– 2-phase locking to track write

– Compiler optimizations

18

First-gen STM

• Very expensive. E.g. Harris+Fraser, OOPSLA ’03):

– Every load and store instruction logs information into a thread-local log

– A store instruction writes the log only

– A load instruction consults the log first

– At the end of the block: validate the log; and atomically commit it to shared memory

19

Direct Update STM

• Augment objects with (i) a lock, (ii) a version number

• Transactional write:

– Lock objects before they are written to (abort if another thread has that lock)

– Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread

– Make the update in place to the object

20

• Transactional read:

– Log the object’s version number

– Read from the object itself

• Commit:

– Check the version numbers of objects we’ve read

– Increment the version numbers of object we’ve written, unlocking them

21

22

Example: contention between transactions

atomic { t = c1.val; t ++; c1.val = t;}

Thread T2

int t = 0;atomic { t += c1.val; t += c2.val;}

Thread T1

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

T2’s log:

23



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

T2’s log:

T1 reads from c1: logs that it saw

version 100

24



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

c1.ver=100

T2’s log:

T2 also reads from c1: logs that it saw

version 100

25



Thread T2


Thread T1

c1.ver=100c2.ver=200

T1’s log:

ver = 200

val = 40

c2

ver = 100

val = 10

c1

c1.ver=100

T2’s log:

Suppose T1 now reads from c2, sees it

at version 200

26



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 10

c1

c1.ver=100lock: c1, 100

T2’s log:

Before updating c1, thread T2 must lock it: record old

version number

locked:T2

27



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 11

c1

c1.ver=100lock: c1, 100c1.val=10

T2’s log:(2) After logging the old

value, T2 makes its update in place to c1

locked:T2

(1) Before updating c1.val, thread T2 must log the data

it’s going to overwrite

28



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 11

c1


T2’s log:

ver=101

(2) T2’s transaction commits successfully. Unlock the object,

installing the new version number

(1) Check the version we locked matches the version

we previously read

29



Thread T2


Thread T1

c1.ver=100

T1’s log:

ver = 200

val = 40

c2

val = 11

c1


T2’s log:

ver=101

(1) T1 attempts to commit. Check the versions it read are still up-to-date.

(2) Object c1 was updated from version 100 to 101, so T1’s transaction is

aborted and re-run.

30

Compiler integration

• We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL)

– OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples)

– OpenForUpdate – before the first time we update an object

– LogOldValue – before the first time we write to a given field

atomic { … t += n.value; n = n.next; …}

OpenForRead(n);t = n.value;OpenForRead(n);n = n.next;

OpenForRead(n);t = n.value;n = n.next;

Source code Basic intermediate code Optimized intermediate code

31

Runtime integration – garbage collection

atomic { …

}

1. GC runs while some threads are in atomic blocks

obj1.field = oldobj2.ver = 100obj3.locked @ 100

32


atomic { …

}



2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

33


atomic { …

}




3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back

34


atomic { …

}




3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back

4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds

35

Results: Against Previous WorkN

orm

alis

ed e

xecu

tion

time

Sequential baseline (1.00x)

Coarse-grained locking (1.13x)

Fine-grained locking (2.57x)

Direct-update STM (2.04x)

Direct-update STM + compiler integration

(1.46x)

Harris+Fraser WSTM (5.69x)

Workload: operations on a red-black tree, 1

thread, 6:1:1 lookup:insert:delete mix

with keys 0..65535

Scalable to multicore

Scalability (µ-benchmark)

36#threads

Fine-grained locking

Direct-update STM + compiler integration

WSTM (atomic blocks)

Coarse-grained locking

Mic

rose

cond

s pe

r op

erat

ion

DSTM (API)

OSTM (API)

37

Results: long running tests2

.04

2.1

1

6.0

5

5.8

4

3.9

1

2.0

8

2.0

5

4.3

2

3.1

6

2.6

1

1.7

2

1.6

8

1.0

0

1.0

0

1.0

0

1.0

0

1.0

0

0

1

2

3

4

5

6

7

1 2 3 4 5tree goskip merge-sort xlisp

Norm

alis

ed e

xecu

tion t

ime

10.8 73 16

2

Direct-update STM

Run-time filteringCompile-time optimizations

Original application (no tx)

Summary

• A pure software implementation can perform well and scale to vast transactions

– Direct update

– Pessimistic & optimistic CC

– Compiler support & optimizations

• Still need a better understanding of realistic workload distributions

38

Atomic Sections for C-like Languages

39

40

Intro

• Autolocker: Synchronization Inference for Atomic SectionsBill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL 2006

• Question: Can we have language support for atomic sections for languages like C?

– No object meta-data

– No type safety

– No garbage collection

• Answer: Let programmer help a bit (w/ annotations)

– And do a simpler one: NO aborts!

41

Autolocker: C + Atomic Sections

• Shared data is protected by annotated locks

• Threads access shared data in atomic sections:

• Threads never deadlock (due to Autolocker)

• Threads never race for protected data

• How can we implement this semantics?

mutex m;int shared_var protected_by(m);atomic { ... x = shared_var; ... }

Code runs as if a single lock protects all atomic sections

42

Autolocker Transformation

• Autolocker is a source-to-source transformation

int m1, m2;int x;int y;

begin_atomic(); acquire(m1); x = 3; acquire(m2); y = 2;end_atomic();

mutex m1, m2;int x protected_by(m1);int y protected_by(m2);

atomic { x = 3; y = 2;}

Autolocker code C code

43






atomic { x = 3; y = 2;}

Atomic sections can be nested arbitrarilyThe nesting level is tracked at runtime


44




atomic { x = 3; y = 2;}

Locks are acquired as neededLock acquisitions are reentrant




45




atomic { x = 3; y = 2;}

Locks are released when outermost section endsStrict two-phase locking: guarantees atomicity




46




atomic { x = 3; y = 2;}

Locks are acquired in a global orderAcquiring locks in order will never lead to deadlock




47

Outline

• Introduction

– semantics

– usage model and benefits

• Autolocker algorithm

– match locks to data

– order lock acquisitions

– insert lock acquisitions

• Related work

• Experimental evaluation

• Conclusion

48

finish heremany fine-grained locks

low contentionhigh parallelism

start hereone coarse-grained lock

lots of contentionlittle parallelism

• Typical process for writing threaded software:

• Linux kernel evolved to SMP this way

• Autolocker helps you program in this style

Autolocker Usage Model

shared data

threads

lock

49

Granularity and Autolocker

• In Autolocker, annotations control performance:

• Simpler than fixing all uses of shared_var

• Changing annotations won’t introduce bugs

– no deadlocks

– no new race conditions

int shared_var protected_by(kernel_lock);

int shared_var protected_by(fs->lock);

50

Outline

• Introduction

– semantics



– match locks to data

– order lock acquisitions

– insert lock acquisitions

• Related work


• Conclusion

51

generate globallock order

Algorithm Summary

atomic { …}

C programwith atomic

sections

begin acq L1, L2;end

lock order

cyclic?

yesno

fail and reportpotential deadlock

C programwith acquirestatements

insert orderedlock acquisitions

lockrequirements

remove redundantacquisitions

begin acq L2;end

match locks to data

52

Algorithm Summary

atomic { …}


sections


lock order

cyclic?

yesno



match locks to data



lockrequirements


begin acq L2;end

53

Matching Locks to Data

foo bar

quxbaz

functions

C program withatomic sections

void foo() { atomic { use m2; y = 3; use m1; x = 2; }}


Symbol table

use L ≡

the lock L must already be held when this statement is reached

54

Algorithm Summary

atomic { …}


sections


lock order

cyclic?

yesno





lockrequirements


begin acq L2;end

match locks to data

55

Acquisition Placement

• Assume there’s an acyclic order “<” on locks

• We need to turn uses into acquires:


global lock order

We must acquire m2…

m1 < m2

56





global lock order


... but since m1 is needed later

… and it’s ordered first

m1 < m2

57




void foo() { atomic { acquire m1; acquire m2; y = 3; use m1; x = 2; }}

m1 < m2

global lock order


... but since m1 is needed later

… and it’s ordered first

… we acquire m1 first

58




void foo() { atomic { acquire m1; acquire m2; y = 3; use m1; x = 2; }}

global lock order

Preemptive acquiresPreemptively take locks that:• are used later• but ordered before

m1 < m2

59




void foo() { atomic { acquire m1; acquire m2; y = 3; acquire m1; x = 2; }}

global lock order

m1 < m2

Eventually we’ll optimizeaway this acquisition

60

match locks to data

Algorithm Summary

atomic { …}


sections


lock order

cyclic?

yesno





lockrequirements


begin acq L2;end

61

• Some lock orders break the insertion algorithm!

• We say that “p->m < m” is not a feasible order

• So how do we find a feasible lock order?

What Is a Good Lock Order?

use m;p = p->next;use p->m;

Assume p->m < macquire p->m;acquire m;p = p->next;acquire p->m;

If “p->m < m” then m cannot be acquired before any lock called p->m

… this lock has never been acquired before.It is acquired after m!

Because of the assignment…via insertion algo

62

Finding a Feasible Lock Order

• We want only feasible orders

• We search for any code matching this pattern:

• Any feasible order must ensure e1 < e2

• Scan through all code to find constraints

use e1;…maykill e2;…use e2;

e2 cannot be acquired before the update.It’s value is different above there.

Signifies anything that may affect e2’s value(like “p = p->next” when e2 = “p->m”)

e1 cannot be acquired after it is used.

63

Finding a Feasible Lock Order

foo bar

quxbaz

functions

Constraints

m1 < p->m’

p->m < p->m’

m1 < r->mm3 < p->m

q->m < p->m

m1r->mq->mp->mp->m’

Global Lock Order

search forinfeasiblepatterns

topological sort

64

Algorithm Summary

atomic { …}


sections


lock order

cyclic?

yesno





lockrequirements


begin acq L2;end

match locks to data

65

Outline

• Introduction

– semantics



– computing protections

– lock ordering

– acquisition placement

• Related work


• Conclusion

66

Comparison

• Transactional memory with optimistic CC

– Threads work locally and commit when done

– If two threads conflict, one rolls back & restarts

– Benefit: no complex static analysis

• Drawbacks:

– software versions: lots of copying, can be slow

– hardware versions: need new hardware

– both: some operations cannot be rolled back (e.g., fire missile)

• How does this compare with Autolocker?

67

Experimental Questions

• Question 1:

– What is the performance cost of using Autolocker?

– How does it compare to other concurrency models?

68

Concurrent Hash Table

• Simple microbenchmark

• Goal: Same performance as hand-optimized

– low overhead

– high scalability

• Compared Autolocker to:

– manual locking

– Fraser’s object-based transactional memory mgr.

– Ennal’s revised transactional memory mgr.

69

Hash Table Benchmark

Machine: four processors, 1.9 GHz, HyperThreading, 1 GB RAMEach datapoint is the average of 4 runs after 1 warmup run

70

Experimental Questions

• Question 1:

– What is the performance cost of using Autolocker?

– How does it compare to other concurrency models?

• Question 2:

– Autolocker may reject code for “potential deadlocks”

– Does it accept enough programs to be useful?

– Will it only accept easy, coarse-grained policies?

71

AOLServer

• Threaded web server using manual locking

• Goal: Preserve the original locking policy with AL

Size52,589 lines

82 modules

Changes143 atomic sections added

126 types annotated with protections

Problems

175 “trusted” casts (23 worrisome)

78/82 modules used original locking policies

Performance negligible impact (~3%)

72

Conclusion

• Contributions:

– a new model for programming parallel systems

– a transformation tool to implement it

• Benefits:

– performance close to well-written manual locking

– freedom from deadlocks

– freedom from races on protected data

– very good performance when “ops/sync” is high

– increase performance without increasing bugs!

References

• Optimizing memory transactions, Tim Harris, Mark Plesko, Avraham Shinnar, David Tarditi, PLDI’06

• Autolocker: Synchronization Inference for Atomic Sections, Bill McCloskey, Feng Zhou, David Gay, Eric Brewer, POPL’06

• The free lunch is over: A fundamental turn toward concurrency in software, Herb Sutter, Dr. Dobb’s Journal, 2005

• Unlocking Concurrency, Ali-Reza Adl-Tabatabai et al, ACM Queue, 2006

• Transactional Memory, James Larus, Ravi Rajwar, Morgan Kaufmann, 2006 (book)

Contact me: [email protected]

http://zhoufeng.net73

Backup slides

74

A Lot of Questions to Answer

75From “The Landscape of Parallel Computing Research: A View from Berkeley”

programming multi-core systems 周枫 网易公司 2007-12-28 清华大学

Documents

programming multi-core systems 周枫网易公司 2007-12-28 清华大学