performance evaluation of adaptivity in stmperformance evaluation of adaptivity in stm mathias payer...

Performance Evaluation of Adaptivity in STM

Mathias Payer and Thomas R. GrossDepartment of Computer Science,

ETH Zürich

ISPASS'11 / 2011-04-12 Mathias Payer / ETH Zürich 2

Motivation● STM systems rely on many assumptions

● Often contradicting for different programs● Statically tuned to a baseline

● Use self-optimizing systems● Adapt to different workloads

● What parameters can be adapted?● How to measure effectiveness?


Outline● Introduction● STM System

● STM Baseline● Adaptive Parameters

● Evaluation● Related work● Conclusion


Introduction● Software Transactional Memory (STM) applies

transactions to memory● (Optimistic) concurrency control mechanism● Alternative to lock-based synchronization

● Multiple concurrent threads run transactions● Concurrent memory modifications


Introduction● Concurrent transactions modify memory without

synchronization● Transaction is verified after completion● Conflicts are detected and resolved● Changes committed for conflict-free transactions● Modifications only visible after commit


Introduction

withdraw { tmp = balance; tmp = tmp – 100 balance = tmp;}

deposit { tmp = balance; tmp = tmp + 100 balance = tmp;}

● What happens when balance is accessed concurrently?● Either locking or STM needed to ensure correct end

balance● STM system decides which tx is executed first

TX starts

balance inread-set

balance inwrite-setConflict detection,

data committed


STM Baseline● Many efficient STM implementations agree on

important design decisions:● Word-based locking● Global locking / version table● Eager locking● (Almost) no contention management● Simple write-set and read-set implementations


STM Baseline

Combined global write lock / version array

Transaction

Lock list

Write Hash

Read Hash

Writelist /

buffer

Read list /

buffer

Transaction

Lock list

Write Hash

Read Hash

Writelist /

buffer

Read list /

buffer


Adaptive STM Parameters● Global adaptivity

● Synchronization needed● Optimizes to global optimum● Averages over all concurrent transactions

● (Thread-) local adaptivity● No synchronization needed● Limits adaptable parameters● Best parameters for each thread/transaction


Adaptive STM Parameters● Different adaptive parameters measured:

● Size of global locking/version-table *G● Size of local hash-tables *L● Write strategy *L● Locality tuning for hash-functions *L● Contention management *L

*L – local, *G – global


Adaptive Hash-Table● Global hash-table: trade-off between over-

locking and locality● Global strategy: coordinate lock collisions and over-

locking between threads● Adapt size based on global information

● Local hash-table: trade-off between reset cost, and # hash-collisions● Local strategy: sample moving average of unique

write locations● Adapt size based on trend


Adaptive Write Strategy● Different costs depending on strategy

● Write-back: cheap abort, expensive commit● Write-through: expensive abort, cheap commit

● Adapt strategy to per-thread workload● Measure abort rate


Adaptive Locality Tuning● Different applications have different data

access patterns● No optimal hash function for all data accesses

● Measure number of hash collisions for thread-local hash tables● Circle through different hash functions


Adaptive Contention Management● No single strategy works in all environments

● Measure contention and implement an adaptive back-off strategy● Wait and retry● Abort later


Local Adaptive STM Parameters(for local hash-table)

0

enlarge write-hash

shrink write-hash

no change

# w

rites

vs.

has

h-ta

ble

spac

e



# hash collisions0

changehash-function

no change



# hash collisions0

changehash-function

enlarge write-hash

shrink write-hash

no change

# w

rites

vs.

has

h-ta

ble

spac

eenlarge write-hash

&change hash-function

shrink write-hash &change hash-function


AdaptSTM● Adaptive STM system built on presented

features● Statically tuned competitive baseline

– Static global hash function and hash table● Mature and stable implementation● Different local adaptive parameters

– Write-set hash function and size of hash table– Write-through and write-back write strategy– Adaptive contention management


Evaluation● Benchmark: STAMP 0.9.10

● ++ configuration (increased workload for kmeans)

● AdaptSTM version 0.5.1

● Intel 4-core Xeon E5520 CPU● 8 cores @ 2.27GHz, 12GB RAM● 64bit Ubuntu 9.04


Evaluation: Global Hash-Table

0 2 4 6 8 10

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Genome

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]0 2 4 6 8 10

0

10

20

30

40

50

60

70

80

kmeans

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]


Evaluation: Global Adaptivity● Global optimizations have limited potential

● Small optimization potential● High synchronization cost● Reasonable baseline outperforms global

optimization


Evaluation: Local Adaptivity● Different configurations:

● naWB: no adaptivity, use write-back● aWBT: adaptivity, adjust write-through / write-back● aWWH: aWBT plus an adaptive hash-table for the

write-set● aWHH: aWWH plus different hash functions● aALL: all adaptive parameters plus Bloom filter for

write-entries

● Adaptation system starts with best 'average' parameters, improves from there


Evaluation: Local Adaptivity

● aWBT: adaptive, write-back/-through

● aWWH: adaptive, write-back/-through, write-hash

● aWHH: adaptive, write-back/-through, write-hash, hash-function

● aALL: adaptive, write-back/-through, write-hash, hash-function, Bloom filter

1 2 4 8 16-15.00%

-10.00%

-5.00%

0.00%

5.00%

10.00%

15.00%

kmeans

aWBTaWWHaWHHaALL

Threads

Spe

edup

to n

on a

dapt

ive

1 2 4 8 16-4.00%

-3.00%

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

Labyrinth

aWBTaWWHaWHHaALL

ThreadsS

peed

up to

non

ada

ptiv

e



1 2 4 8 16

-3.00%

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

Genome

aWBTaWWHaWHHaALL

Threads

Spe

edup

to n

on a

dapt

ive

1 2 4 8 16

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

Vacation

aWBTaWWHaWHHaALL

Threads

Spe

edup

to n

on a

dapt

ive

● aWBT: adaptive, write-back/-through

● aWWH: adaptive, write-back/-through, write-hash

● aWHH: adaptive, write-back/-through, write-hash, hash-function

● aALL: adaptive, write-back/-through, write-hash, hash-function, Bloom filter


Evaluation: Local Adaptivity● No single optimization works for all benchmarks● Combination of all options leads to best

performance● Impressive speed-ups for individual

benchmarks compared to the globally optimized case


Related Work● TL2 (Dice et al.): baseline STM system● Different related work on static tuning of global

parameters (Harris, Dice, Ennals, Felber)● Crucial for efficient baseline

● TinySTM (Felber et al.): adapts size and hash function of global locking table

● ASTM (Marathe et. al.): adapts lazy-eager locking strategies and different meta-formats


Conclusions● Adaptivity in STM is important for good

performance● Speedups up to 10% possible

● Global optimization are limited● Low potential, high synchronization cost

● Local optimizations tune thread-local parameters● High correlation with workload


Questions

● Contact: [email protected]● Source: http://nebelwelt.net/projects/adaptSTM/

?

mailto:[email protected]

http://nebelwelt.net/projects/adaptSTM/



0 2 4 6 8 10

0

5

10

15

20

25

30

Bayes

4 Threads

# Shifts

Tim

e [s

]

0 2 4 6 8 10

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Genome

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]

0 2 4 6 8 10

0

5

10

15

20

25

30

Vacation

4 Threads

# Shifts

Tim

e [s

]

0 2 4 6 8 10

0

10

20

30

40

50

60

70

80

kmeans

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]



0 2 4 6 8 10

0

5

10

15

20

25

Labyrinth

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]

0 2 4 6 8 10

0

2

4

6

8

10

12

14

16

18

20

Intruder

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]

0 2 4 6 8 10

0

2

4

6

8

10

12

14

16

18

SSCA2

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]

0 2 4 6 8 10

0

5

10

15

20

25

30

35

40

45

50

YADA

4 Threads

2^162^182^202^222^242^26

# Shifts

Tim

e [s

]


STM Comparison

1 2 4 8 16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Genome

astmtl2tstmtstm099

Threads

Re

lativ

e r

un

time

1 2 4 8 16

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Vacation

astmtl2tstmtstm099

Threads

Re

lativ

e r

un

time

1 2 4 8 16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Labyrinth

astmtl2tstmtstm099

Threads

Re

lativ

e r

un

time

1 2 4 8 16

0

1

2

3

4

5

6

Intruder

astmtl2tstmtstm099

Threads

Re

lativ

e r

un

time



1 2 4 8 16

-4.00%

-3.00%

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

Bayes

aWBTaWWHaWHHaALL

Threads

Sp

ee

du

p to

no

n a

da

ptiv

e

1 2 4 8 16

-3.00%

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

SSCA2

aWBTaWWHaWHHaALL

Threads

Sp

ee

du

p to

no

n a

da

ptiv

e



1 2 4 8 16

-2.00%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

YADA

aWBTaWWHaWHHaALL

Threads

Sp

ee

du

p to

no

n a

da

ptiv

e

performance evaluation of adaptivity in stmperformance evaluation of adaptivity in stm mathias payer...

Documents