an integrated hardware-software approach to transactional memory
DESCRIPTION
An Integrated Hardware-Software Approach to Transactional Memory. Sean Lie 6.895 Theory of Parallel Systems Monday December 8 th , 2003. if (iTRANSCRIPT
An Integrated Hardware-Software Approach to Transactional Memory
Sean Lie
6.895 Theory of Parallel Systems
Monday December 8th, 2003
Transactional Memory
if (i<j) {
a = i; b = j;
} else {
a = j; b = i; }
Lock(L[a]); Lock(L[b]);
Flow[i] = Flow[i] – X;
Flow[j] = Flow[j] + X;
Unlock(L[b]); Unlock(L[a]);
StartTransaction;
Flow[i] = Flow[i] – X;
Flow[j] = Flow[j] + X;
EndTransaction;
Transactional memory provides atomicity without the problems associated with locks.
Locks
Transactional Memory
I propose an integrated hardware-software approach to transactional memory.
Hardware Transactional Memory
HTM: Transactional memory can be implemented in hardware using the cache and cache coherency mechanism. [Herlihy & Moss]
Uncommitted transactional data is stored in the cache.
Transactional data is marked in the cache using an additional bit per cache line.
HTM has very low overhead but has size and length limitations.
DataTrans Tag Status1
0
1
0
Software Transactional Memory
STM: Transactional memory can be implemented in software using compiler and library support.
Uncommitted transactional data is stored in a copy of the object.
Transactional data is marked by flagging the object field.
STM does not have size or length limitations but has high overhead.
Original Object
Trans Version
Readers List
248
‘a’
FLAG
77
“Mass. Ave.”
NULL
NULL
NULL
48
NULL
NULL
FLEX SoftwareTransaction System
Results
An integrated approach gives the best of both worlds. Common case:
HTM mode - Small/short transactions run fast. Uncommon case:
STM mode - Large/long transactions are slower but possible.
An integrated hardware-software transactional memory system was implemented and evaluated. HTM was implemented in the UVSIM software simulator. A subset of STM functionality was implemented for the
benchmark applications. HTM was modified to be software-compatible.
Hardware vs. Software
HTM has much lower overhead than STM. A network flow µbenchmark (node-push) was
implemented for evaluating overhead.
However, HTM has 2 serious limitations.
Atomicity Mechanism
“Worst” “More Realistic”
Cycles (% of Base)
Locks 505% 136%
HTM 153% 104%
STM 1879% 206%
1 processor overheads:“worst” case:Back-to-back small transactions
“more realistic” case:Some processing between small transactions
Hardware Limitation:Cache Capacity
HTM uses the cache to hold all transactional data.
Therefore, HTM aborts transactions larger than the cache.
Restricting transaction size is awkward and not modular. Size will depend on
associativity, block size, etc. in addition to cache size.
Cache configuration change from processor to processor.
DataTrans Tag Status1
1
1
1
Hardware Limitation:Context Switches
The cache is the only transactional buffer for all threads.
Therefore, HTM aborts transactions on context switches.
Restricting context switches is awkward and not modular. Context switches occur
regularly in modern systems (e.g.. TLB exceptions).
Processor
Thread a
Thread b
Cache
Trans data for thread a
HSTM: An Integrated Approach
Transactions are switched from HTM to STM when necessary.
When a transaction aborts in HTM, it is restart in STM.
HTM is modified to be software-compatible.
DataTrans Tag Status1
0
1
0
Readers List248‘a’
FLAG
77“Mass. Ave.”
Software-Compatible HTM
1 new instruction: xAbort
Additional checks are performed in software: On loads, check if the memory
location is set to FLAG. On stores, check if there is a
readers list.
Software checks are slow.
Performing checks adds a 2.2x performance overhead over pure HTM in “worst” case (1.1x in “more realistic” case).
Readers List
248
‘a’
FLAG
77
“Mass. Ave.”
Original Object
null ?null ?
Store Check
flag ?flag ?
Load Check
Y
N
Normal Memory Operation
Abort
YN
Overcoming Size Limitations
The node-push benchmark was modified to touch more nodes to evaluate size limitations.
HSTM uses HTM when possible and STM when necessary.
0
50
100
150
200
250
300
1 5 9 13 17 21 25 29 33 37 41 45 49
Transaction Size (Number of Nodes x 100)
Cy
cle
s p
er
No
de
HTM
STM
HTM Transactions stop fitting after this point
Overcoming Size Limitations
HTM Transactions stop fitting after this point
0
50
100
150
200
250
300
1 5 9 13 17 21 25 29 33 37 41 45 49
Transaction Size (Number of Nodes x 100)
Cy
cle
s p
er
No
de
HTM
STM
HSTM
The node-push benchmark was modified to touch more nodes to evaluate size limitations.
HSTM uses HTM when possible and STM when necessary.
Overcoming Context Switching Limitations
Context switches occur on TLB exceptions.
The node-push benchmark was modified to choose from a larger set of nodes. More nodes higher probability of TLB miss (Pabort).
HSTM behaves like HTM when Pabort is low and like STM when Pabort
is high.
0
500
1000
1500
2000
2500
3000
1 4 7 10 13 16 19 22 25 28 31
Number of Nodes (x1000)
Avg
. C
ycle
s p
er I
tera
tio
n
0
1
2
3
4
5
6
Avg
. A
bo
rts
per
HT
M T
ran
sact
ion
TLB Aborts
STM
HTM
Overcoming Context Switching Limitations
Context switches occur on TLB exceptions.
The node-push benchmark was modified to choose from a larger set of nodes. More nodes higher probability of TLB miss (Pabort).
HSTM behaves like HTM when Pabort is low and like STM when Pabort
is high.
0
500
1000
1500
2000
2500
3000
1 4 7 10 13 16 19 22 25 28 31
Number of Nodes (x1000)
Avg
. C
ycle
s p
er I
tera
tio
n
STM
HTM
HSTM
Conclusions
An integrated approach gives the best of both worlds. Common case:
HTM mode - Small/short transactions run fast. Uncommon case:
STM mode - Large/long transactions are slower but possible.
Trade-offs: “STM mode” is not has fast as pure STM.
This is acceptable since it is uncommon. “HTM mode” is not has fast as pure HTM.
Is this acceptable?
Future Work
Full implementation of STM in UVSIM
Integration of software-compatible HTM into the FLEX compiler
Evaluate how software-compatible HTM performs for parallel applications
Should software-compatible modifications be moved into hardware?
Can a transaction be transferred from hardware to software during execution?