an update on haskell h/stm - share. · pdf filelowered implementation level of tm runtime....
TRANSCRIPT
![Page 1: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/1.jpg)
An Update on Haskell H/STM1
Ryan Yates and Michael L. Scott
University of Rochester
TRANSACT 10, 6-15-2015
1This work was funded in part by the National Science Foundation undergrants CCR-0963759, CCF-1116055, CCF-1337224, and CCF-1422649, and bysupport from the IBM Canada Centres for Advanced Studies.
1/16
![Page 2: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/2.jpg)
Outline
Haskell TM.Our implementations using HTM.Performance results.Future work.
Slides: http://goo.gl/0ZFJXJ Paper: http://goo.gl/Er29ef
2/16
![Page 3: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/3.jpg)
Haskell TM
At last TRANSACT we reported around 280 open source librariesin the Haskell ecosystem that depend on STM. Now there are over400.
Reasons for using Haskell STM:
It is easy!
Expressive API with retry and orElse.
Trivial to build into libraries.
3/16
![Page 4: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/4.jpg)
Haskell STM Example
transA = do
v <- dequeue(queue1)
return v
transB = do
v <- dequeue(queue2)
if someCondition(v)
then return v
else retry
...
mainLoop = do
a <- atomically(transA ‘orElse‘ transB ‘orElse‘ ...)
handleRequest(a)
mainLoop
4/16
![Page 5: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/5.jpg)
Existing Haskell STM Implementations
Glasgow Haskell Compiler (GHC), 7.8.
Explicit transactional variables.
Object based.
Lazy value-based validation.
5/16
![Page 6: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/6.jpg)
Existing Haskell STM Implementations
Coarse-grain Lock (STM-Coarse)
Serialize commits with a global lock.
Similar to NOrec [Dalessandro et al., 2010,Dalessandro et al., 2011, Riegel et al., 2011].
Fine-grain Locks (STM-Fine)
Lock for each TVar.
Two-phase commit.
Similar to OSTM [Fraser, 2004].
6/16
![Page 7: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/7.jpg)
Haskell with HTM
Hybrid TM (Hybrid)
Three levels for transactions (CF [Matveev and Shavit, 2013]).
Full transactions in hardware.
Software transaction, commit in hardware.
Full software fallback.
HLE Commit (HTM-Coarse, HTM-Fine)
Like coarse-grain and fine-grain lock STMs, but usinghardware transactions to elide locks around commit.
7/16
![Page 8: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/8.jpg)
Red-black tree performance
Last year
200 nodes, 84,000 tree operations per second on 4 threads.
Now
50,000 nodes, 24,000,000 tree operations per second on 72 threads.
8/16
![Page 9: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/9.jpg)
Red-black tree performance
What changed?
Constant space2 metadata tracking for retry.
Lowered implementation level of TM runtime.
Avoid reevaluating expensive thunks.
Fixed PRNG.
Also improved benchmarking for more accurate measurement.
2Nearly.
9/16
![Page 10: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/10.jpg)
Results (Intel c© XeonTM E5-2699 v3 two socket, 36-core)
1 8 18 36 54 72
0
0.5
1
1.5
2
2.5·107
Threads
Tre
eop
erat
ion
sp
erse
con
d
HashMapSTM-CoarseHTM-Coarse
HybridSTM-FineHTM-Fine
10/16
![Page 11: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/11.jpg)
Results (bad PRNG) (Intel c© XeonTM E5-2699 v3 two socket, 36-core)
1 8 18 36 54 72
0
0.5
1
1.5
2
2.5·107
Threads
Tre
eop
erat
ion
sp
erse
con
d
HashMapSTM-CoarseHTM-Coarse
HybridSTM-FineHTM-Fine
11/16
![Page 12: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/12.jpg)
Results (single socket) (Intel c© XeonTM E5-2699 v3 two socket, 36-core)
1 2 4 6 8 10 12 14 16 18
0
0.5
1
1.5
·107
Threads
Tre
eop
erat
ion
sp
erse
con
d
HashMapSTM-CoarseHTM-Coarse
HybridSTM-FineHTM-Fine
12/16
![Page 13: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/13.jpg)
Results retry (Intel c© XeonTM E5-2699 v3 two socket, 36-core)
0 20 40 60
0
2
4
6
·106
Threads
Qu
eue
tran
sfer
sp
erse
con
d
STM-CoarseHTM-Coarse
HybridSTM-FineHTM-Fine
13/16
![Page 14: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/14.jpg)
Future Implementation Work TStruct
Flexible transactional variable granularity.
Unboxed mutable variables.
right
left
parent
color
value
key
hash
lock
Node
right
left
parent
color
value
key
hash
lock
Node
color
right
left
parent
value
key
Node
hash
value
TVar
color
right
left
parent
value
key
Node
14/16
![Page 15: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/15.jpg)
Future Implementation Work (orElse)
Supporting orElse in Hardware Transactions
t1 = atomically (a ‘orElse‘ b)
Atomically choose second branch when the first retrys.
No direct support in hardware for a partial rollback.
If the first transaction does not write to any TVars, there isnothing to roll back.
Keep a TRec while running the first transaction.
Or rewrite the first transaction to delay writes until after thechoice to retry.
15/16
![Page 16: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/16.jpg)
Summary
We have a much better understanding of the performanceissues.
Performance on a concurrent set is competative at scale.
Good performance for infrequent retry use cases.
HTM is a useful and flexible tool that helps performance.
We have a roadmap for future improvements.
Slides: http://goo.gl/0ZFJXJ Paper: http://goo.gl/Er29ef
16/16
![Page 17: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/17.jpg)
17/16
![Page 18: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/18.jpg)
Supporting retry
Existing retry Implementation
When retry is encountered, add the thread to the watch listof each TVar in the transaction’s TRec.
When a transaction commits, wake up all transactions inwatch lists on TVars it writes.
18/16
![Page 19: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/19.jpg)
Supporting retry
Hardware Transactions
Replace watch lists with bloom filters for read sets.
Support read-only retry directly in HTM.
Record write-set during HTM then perform wake-ups afterHTM commit.
19/16
![Page 20: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/20.jpg)
Wakeup Structure
Committed writer transactions search blocked thread read-setsin a short transaction eliding a global wakeup lock.
Committing HTM read-only retry transactions atomicallyinsert themselves in the wakeup structure by writing theglobal wakeup lock inside the hardware transaction.
Releases lock when successfully blocked.
Aborts wakeup transaction (short and cheap).
Serializes HTM retry transactions (rare anyway).
20/16
![Page 21: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/21.jpg)
Future Implementation Work (orElse)
Existing orElse Implementation
Atomic choice between transactions biased toward the first.
Nested TRecs allow for partial rollback.
If the first transaction encounters retry, throw away thewrites, but merge the reads and move to the secondtransaction.
21/16
![Page 22: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/22.jpg)
Haskell STM Metadata Structure
new
old
tvar
...
new
old
tvar
new
old
tvar
index
prev
TRec
prev
next
thread
Watch Queue
prev
next
thread
Watch Queue
watch
value
TVar
color
right
left
parent
value
key
Node
22/16
![Page 23: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/23.jpg)
Haskell HTM Metadata Structure
...
thread
read-set
Wakeup
write-set
read-set
HTRec
hash
value
TVar
color
right
left
parent
value
key
Node
23/16
![Page 24: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/24.jpg)
Haskell Before TStruct
color
right
left
parent
value
key
Node
hash
value
TVar
color
right
left
parent
value
key
Node
24/16
![Page 25: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/25.jpg)
Haskell with TStruct
right
left
parent
color
value
key
hash
lock
Node
right
left
parent
color
value
key
hash
lock
Node
25/16
![Page 26: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/26.jpg)
Haskell STM TQueue Implementation
data TQueue a = TQueue (TVar [a]) (TVar [a])
dequeue :: TQueue a -> a -> STM ()
dequeue (TQueue _ write) v = modifyTVar write (v:)
enqueue :: TQueue a -> STM a
enqueue (TQueue read write) =
readTVar read >>= \case(v:vs) -> writeTVar read vs >> return v
[] -> reverse <$> readTVar write >>= \case[] -> retry
(v:vs) -> do writeTVar write []
writeTVar read vs
return v
26/16
![Page 27: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/27.jpg)
Haskell STM Implementation
Fairly standard commit protocol, but missing optimizations frommore recent work.
Commit
Coarse grain: perform writes while holding the global lock.
Fine grain:
Acquire locks for writes while validating.
Check that read-only variables are still valid while holding thewrite locks.
Perform writes and release locks.
27/16
![Page 28: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/28.jpg)
Haskell STM
Broken code that we are not allowed to write!
transferBad :: TVar Int -> TVar Int -> Int -> STM ()
transferBad accountX accountY value = do
x <- readTVar accountX
y <- readTVar accountY
writeTVar accountX (x + v)
writeTVar accountY (y - v)
if x < 0
then launchMissles
else return ()
28/16
![Page 29: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/29.jpg)
Haskell STM
Broken code that we are not allowed to write!
thread :: IO ()
thread = do
transfer a b 200
transfer a c 300
29/16
![Page 30: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/30.jpg)
C ABI vs Cmm ABI
GHC’s runtime support for STM is written in C.
Code is generated in Cmm and calls into the runtime areessentially foreign calls with significant extra overhead.
We avoid this by writing the HTM support in Cmm.
Typeclass machinery could allow deeper code specialization.
30/16
![Page 31: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/31.jpg)
Lazy Evaluation
Lazy evaluation may lead to false conflicts due to the updatestep that writes back the fully evaluated value.
One solution could be to delay performing updates (to sharedvalues) until after a transaction commits.
Races here are fine as any update must represent the samevalue.
31/16
![Page 32: An Update on Haskell H/STM - share. · PDF fileLowered implementation level of TM runtime. Avoid reevaluating expensive thunks. Fixed PRNG. Also improved benchmarking for more accurate](https://reader031.vdocuments.net/reader031/viewer/2022030411/5a9dc36b7f8b9abd0a8ccfdf/html5/thumbnails/32.jpg)
References
[Dalessandro et al., 2011] Dalessandro, L., Carouge, F., White, S., Lev, Y., Moir, M., Scott, M. L., and Spear,M. F. (2011).Hybrid NOrec: A case study in the effectiveness of best effort hardware transactional memory.In Proc. of the 16th Intl. Symp. on Architectural Support for Programming Languages and Operating Systems(ASPLOS), pages 39–52, Newport Beach, CA.
[Dalessandro et al., 2010] Dalessandro, L., Spear, M. F., and Scott, M. L. (2010).NOrec: Streamlining STM by abolishing ownership records.In Proc. of the 15th ACM Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 67–78,Bangalore, India.
[Fraser, 2004] Fraser, K. (2004).Practical lock-freedom.PhD thesis, University of Cambridge Computer Laboratory.
[Matveev and Shavit, 2013] Matveev, A. and Shavit, N. (2013).Reduced hardware NOrec.In 5th Workshop on the Theory of Transactional Memory (WTTM), Jerusalem, Israel.
[Riegel et al., 2011] Riegel, T., Marlier, P., Nowack, M., Felber, P., and Fetzer, C. (2011).In Proc. of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 53–64,San Jose, CA.
32/16