thread-level speculation on off-the-shelf hardware ... - ibm · © 2014 ibm corporation...
TRANSCRIPT
© 2014 IBM Corporation
Thread-Level Speculation onOff-the-Shelf HardwareTransactional Memory
Rei OdairaTakuya Nakaike
IBM Research – Tokyo
© 2014 IBM Corporation2
IBM Research - Tokyo
Thread-Level Speculation (TLS) [Franklin et al.,’92]or Speculative Multithreading (SpMT)
� Speculatively parallelize a sequential program into a multithreaded program.
� What is parallelization?
– To find data-independent tasks from a program.
� Why speculation?
– Because a compiler cannot detect every data dependence.
Sequentialexecution
TLSexecutionw/ 3 threads
Task Task Task
© 2014 IBM Corporation3
IBM Research - Tokyo
Runtime Requirements for TLS
� With TLS:
– Compiler finds probably data-independent tasks.
– Runtime guarantees data independence among tasks.
� (Minimum) runtime requirements for TLS
– Data dependence (= conflict) detection among tasks
– Execution rollback at a conflict
– Ordered commit of tasks
TLSexecutionw/ 3 threads
Conflict
Rollback
Ordered commit
© 2014 IBM Corporation4
IBM Research - Tokyo
Hardware Transactional Memory (HTM) Coming into the Market
� HTM supports…
– Conflict detection among transactions
– Execution rollback at a conflict
� HTM satisfies 2/3 of the runtime requirements for TLS!
– Task = transaction
Blue Gene/Q zEC12 POWER8
4th GenerationCore Processor
(Haswell)
© 2014 IBM Corporation5
IBM Research - Tokyo
Our Goal
� How well can TLS improve the performance on real HTM hardware?
– Used Intel 4th Generation Core Processor(Intel TSX).
– Manually modified and measured SPEC CPU2006.
© 2014 IBM Corporation6
IBM Research - Tokyo
Our True Goal
� How poorly can TLS improve the performance on real HTM hardware?
� Because proposed TLS systems had advanced hardware support.
– E.g. ordered transactions, data forwarding, etc.
� Blue Gene/Q is the only real system supporting advanced hardware for TLS.
– Ordered transactions
© 2014 IBM Corporation7
IBM Research - Tokyo
Our True Goal
� How poorly can TLS improve the performance on real HTM hardware?
� What kind of hardware support should be implemented next in the off-the-shelf HTM?
© 2014 IBM Corporation8
IBM Research - Tokyo
Transactional Memory
� At programming/compile time
– Enclose critical sections with transaction begin/end operations.
� At execution time
– Memory operations within a transaction observed as one step by other threads.
– Multiple transactions executed in parallel as long as their memory operations do not conflict. xbegin();
a->count++;xend();
xbegin();b->count++;xend();
xbegin();a->count++;xend();
xbegin();a->count++;xend();
Thread X
Thread Y
xbegin();a->count++;xend();
© 2014 IBM Corporation9
IBM Research - Tokyo
HTM� Instruction set (Intel TSX)
– XBEGIN: Begin a transaction
– XEND: End a transaction
– XABORT, etc.
� Micro-architecture
– Read and write sets held in CPU caches
– Conflict detection using CPU cache coherence protocol� Conflict detection by cache line granularity
– Rollback by discarding write set and restoring registers
� Abort reasons:
– Read set and write set conflict
– Read set and write set overflow
– External interruptions, etc.
XBEGIN abort_handler......XEND
abort_handler:...
© 2014 IBM Corporation10
IBM Research - Tokyo
TLS for Loops
� We focus on frequently executed loops.
– Task = iteration(s) = transaction
� Why not parallelize function calls?
– Difficult to implement TLS for function calls on HTM. (Refer to the paper for the details.)
Iteration 1
Iteration 2
Iteration 3
Iteration 1 Iteration 2 Iteration 3Sequentialexecution
TLSexecutionw/ 3 threads
© 2014 IBM Corporation11
IBM Research - Tokyo
TLS on HTM
� Enclose each iteration with XBEGIN and XEND.
� Re-execute iteration in case of abort.
Iteration 1
XB
EG
IN
XE
ND
Iteration 2
XB
EG
IN
XE
ND
Iteration 3
XB
EG
IN
Iteration 3re-execution
XB
EG
IN
XE
ND
Conflict
© 2014 IBM Corporation12
IBM Research - Tokyo
Ordered Transactions
� Must commit in the same order as sequential execution.
– Because data independence can be guaranteed only after all of the preceding iterations have committed.
Iteration 1
XB
EG
IN
XE
NDIteration 2
XB
EG
IN
XE
ND
Iteration 3
XB
EG
IN Commit orderinversion
© 2014 IBM Corporation13
IBM Research - Tokyo
Ordered Transactions by Software� Hardware support by proposed TLS systems
– Wait until the preceding iterations commit.
� Software implementation by checking commit order
– Use a global variable to indicate the next iteration to commit.
– Abort if cannot commit.
Iteration 1
XB
EG
IN
XE
ND
Iteration 2
XB
EG
IN
XE
ND
Iteration 3
XB
EG
IN
Cancommit?
Cancommit?
Cancommit?
Iteration 3re-executionX
BE
GIN
XE
ND
Cancommit?
© 2014 IBM Corporation14
IBM Research - Tokyo
Ordered Transactions by Software� Hardware support by proposed TLS systems
– Wait until the preceding iterations commit.
� Software implementation by checking commit order
– Use a global variable to indicate the next iteration to commit.
– Abort if cannot commit.
Iteration 1
XB
EG
IN
XE
ND
Iteration 2
XB
EG
IN
XE
ND
Iteration 3
XB
EG
IN
Cancommit?
Cancommit?
Cancommit?
Iteration 3re-executionX
BE
GIN
XE
ND
Cancommit?
Why not spin-wait?Refer to our paper….
© 2014 IBM Corporation15
IBM Research - Tokyo
Our Goal
� How poorly can TLS improve the performance on real HTM hardware?
� What kind of hardware support should be implemented next in the off-the-shelf HTM?
– Will hardware support for ordered transactions really help?
© 2014 IBM Corporation16
IBM Research - Tokyo
False Sharing due to Cache-Line Granularity Conflict Detection
double array[];...for (int i = ...; i < ...; i++) {
...array[i] = ...;...
}
Cache line = 64 bytes on x86
Writes by Thread 1
Writes by Thread 2
Writes by Thread 3
array[]
TLS
© 2014 IBM Corporation17
IBM Research - Tokyo
Transaction Coarsening to Avoid False Sharing
Iteration 1X
BE
GIN
XE
ND
Iteration 2 Iteration 8…
Iteration 9
XB
EG
IN
XE
ND
Iteration 10 Iteration 16…
Iteration 17
XB
EG
IN
XE
ND
Iteration 18 Iteration 24…
array[]
Writes by Thread 1Writes by Thread 3 Writes by Thread 2
© 2014 IBM Corporation18
IBM Research - Tokyo
Benchmarks and Methodology
� SPEC CPU2006
– 6 benchmarks showing more than 1.5-fold speedups with 4 threads in a previous TLS study [Packirisamy et al., 2009]
– 429.mcf, 433.milc, 456.hmmer, 464.h264ref, 470.lbm, and 482.sphinx3
� Manually modified frequently executed loops.
– Inserted XBEGIN, XEND, and commit order checks.
– Transformed a target loop into a doubly-nested loop for transaction coarsening
� Experimental environment
– Core i7-4770 processor (4 cores, 2-way SMT)
– 4-GB memory
– Linux 2.6.32-431 / GCC 4.9.0
© 2014 IBM Corporation19
IBM Research - Tokyo
Normalized Throughput Results
� Up to 11% speedups with 2 or 4 threads.
� But mostly degraded the throughput.
0
0.5
1
1.5
0 1 2 3 4 5 6 7 8 9Number of software threads
Thr
ough
put
(1 =
seq
uent
ial)
429.mcf 433.milc456.hmmer 464.h264ref470.lbm 482.sphinx3
SMT enabled
Higher is better
© 2014 IBM Corporation20
IBM Research - Tokyo
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9
Number of software threads
Abo
rt r
atio
(%
)
Total Order inversionOverflow ConflictOther
433.milc
� Parallel program
– Loop coverage: 23%
� Commit order inversion is a dominant abort reason.� Hardware support for ordered transactions will help.
0
0.5
1
1.5
0 1 2 3 4 5 6 7 8 9Number of software threads
Thr
ough
put
(1 =
seq
uent
ial)
© 2014 IBM Corporation21
IBM Research - Tokyo
429.mcf
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of software threads
Abo
rt r
atio
(%
)
Total
Order inversion
Buffer overflow
Conflict
Other
456.hmmer
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of software threads
Ab
ort r
atio
(%
)
Total
Order inversion
Buffer overflow
Conflict
Other
Abort Statistics (1/2)433.milc
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of software threads
Ab
ort r
atio
(%
)
Total
Order inv
Buffer ove
Conflict
Other
� Conflicts were a dominant abort reason in all of the benchmarks except 433.milc.
© 2014 IBM Corporation22
IBM Research - Tokyo
Abort Statistics (2/2)464.h264ref
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of software threads
Ab
ort r
atio
(%
)
Total
Order inversion
Buffer overflow
Conflict
Other
482.sphinx3
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of software threads
Ab
ort r
atio
(%
)
Total
Order inversion
Buffer overflow
Conflict
Other
470.lbm
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9
Number of software threads
Ab
ort r
atio
(%
)
Total
Order inv
Buffer ove
Conflict
Other
� Conflicts were a dominant abort reason in all of the benchmarks except 433.milc.
© 2014 IBM Corporation23
IBM Research - Tokyo
Reasons for Conflicts and Possible Hardware Support
Multi-version cacheWAR dependence464.h264ref
(Fix in prefetcher)WAW dependence (false sharing by prefetching)
(Fix in prefetcher)WAW dependence (false sharing by prefetching)
482.sphinx3
Word-level conflict detection
WAW dependence (false sharing)
470.lbm
Data forwardingRAW dependence456.hmmer
No433.milc
Data forwardingRAW dependence429.mcf
Possible hardware support
Conflict reasonBenchmark
© 2014 IBM Corporation24
IBM Research - Tokyo
Examples of Read-After-Write Data Dependence
� Hardware support already proposed in TLS literatures.
– Data forwarding.
429.mcf 456.hmmerstatic int size;static DATA array[N];func() {
...for (...) {
...if (...) {
size++;array[size]->field = ...;
}}...
}
for (k = 1; k <= M; k++) {...dc[k] = dc[k-1] + ...;...
}
© 2014 IBM Corporation25
IBM Research - Tokyo
Example of Write-After-Read Data Dependence
� Difficult to analyze by a compiler.
– WAR dependence across different functions in different source files.
� Multi-version caches needed.
464.h264ref
for (...) {...line = func();... = line[0];...
}
static DATA line[N];
DATA *func() {...line[0] = ...;...return line;
}
© 2014 IBM Corporation26
IBM Research - Tokyo
Conflicts Precede Commit Order Inversion
� Commit order matters only when most of the transactions reach the committing points.
� With data dependence, most of the transactions cannot run to the end.
Iteration 1
XB
EG
IN
XE
ND
Iteration 2
XB
EG
IN
XE
ND
Iteration 3
XB
EG
IN
ConflictCommit orderinversion
© 2014 IBM Corporation27
IBM Research - Tokyo
Conflicts due to Prefetching
� Even with transaction coarsening, conflicts still happened.
– 464.h264ref and 482.sphinx3.
� Prefetched adjacent cache lines caused conflicts.
64 bytes 64 bytes 64 bytes
Writes by Thread 1
Writes by Thread 2
Prefetch
Prefetch
Conflict
© 2014 IBM Corporation28
IBM Research - Tokyo
Conclusion
� How well can TLS improve the performance on real HTM hardware?
– Up to 11% speedups with 4 threads in SPEC CPU2006 on 4th Generation Core Processor.
– But degraded throughput in most cases.
� What kind of hardware support should be implemented next in the off-the-shelf HTM?
– Hardware support for ordered transactions will help in parallel programs.
– However, many programs contain data dependence.� Not only ordered transactions, but also other hardware facilities to avoid conflicts should be implemented.
– (Intel should fix the adjacent cache line prefetcher!)