thread-level speculation on off-the-shelf hardware ... - ibm · © 2014 ibm corporation...

28
© 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research – Tokyo

Upload: others

Post on 01-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation

Thread-Level Speculation onOff-the-Shelf HardwareTransactional Memory

Rei OdairaTakuya Nakaike

IBM Research – Tokyo

Page 2: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation2

IBM Research - Tokyo

Thread-Level Speculation (TLS) [Franklin et al.,’92]or Speculative Multithreading (SpMT)

� Speculatively parallelize a sequential program into a multithreaded program.

� What is parallelization?

– To find data-independent tasks from a program.

� Why speculation?

– Because a compiler cannot detect every data dependence.

Sequentialexecution

TLSexecutionw/ 3 threads

Task Task Task

Page 3: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation3

IBM Research - Tokyo

Runtime Requirements for TLS

� With TLS:

– Compiler finds probably data-independent tasks.

– Runtime guarantees data independence among tasks.

� (Minimum) runtime requirements for TLS

– Data dependence (= conflict) detection among tasks

– Execution rollback at a conflict

– Ordered commit of tasks

TLSexecutionw/ 3 threads

Conflict

Rollback

Ordered commit

Page 4: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation4

IBM Research - Tokyo

Hardware Transactional Memory (HTM) Coming into the Market

� HTM supports…

– Conflict detection among transactions

– Execution rollback at a conflict

� HTM satisfies 2/3 of the runtime requirements for TLS!

– Task = transaction

Blue Gene/Q zEC12 POWER8

4th GenerationCore Processor

(Haswell)

Page 5: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation5

IBM Research - Tokyo

Our Goal

� How well can TLS improve the performance on real HTM hardware?

– Used Intel 4th Generation Core Processor(Intel TSX).

– Manually modified and measured SPEC CPU2006.

Page 6: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation6

IBM Research - Tokyo

Our True Goal

� How poorly can TLS improve the performance on real HTM hardware?

� Because proposed TLS systems had advanced hardware support.

– E.g. ordered transactions, data forwarding, etc.

� Blue Gene/Q is the only real system supporting advanced hardware for TLS.

– Ordered transactions

Page 7: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation7

IBM Research - Tokyo

Our True Goal

� How poorly can TLS improve the performance on real HTM hardware?

� What kind of hardware support should be implemented next in the off-the-shelf HTM?

Page 8: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation8

IBM Research - Tokyo

Transactional Memory

� At programming/compile time

– Enclose critical sections with transaction begin/end operations.

� At execution time

– Memory operations within a transaction observed as one step by other threads.

– Multiple transactions executed in parallel as long as their memory operations do not conflict. xbegin();

a->count++;xend();

xbegin();b->count++;xend();

xbegin();a->count++;xend();

xbegin();a->count++;xend();

Thread X

Thread Y

xbegin();a->count++;xend();

Page 9: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation9

IBM Research - Tokyo

HTM� Instruction set (Intel TSX)

– XBEGIN: Begin a transaction

– XEND: End a transaction

– XABORT, etc.

� Micro-architecture

– Read and write sets held in CPU caches

– Conflict detection using CPU cache coherence protocol� Conflict detection by cache line granularity

– Rollback by discarding write set and restoring registers

� Abort reasons:

– Read set and write set conflict

– Read set and write set overflow

– External interruptions, etc.

XBEGIN abort_handler......XEND

abort_handler:...

Page 10: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation10

IBM Research - Tokyo

TLS for Loops

� We focus on frequently executed loops.

– Task = iteration(s) = transaction

� Why not parallelize function calls?

– Difficult to implement TLS for function calls on HTM. (Refer to the paper for the details.)

Iteration 1

Iteration 2

Iteration 3

Iteration 1 Iteration 2 Iteration 3Sequentialexecution

TLSexecutionw/ 3 threads

Page 11: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation11

IBM Research - Tokyo

TLS on HTM

� Enclose each iteration with XBEGIN and XEND.

� Re-execute iteration in case of abort.

Iteration 1

XB

EG

IN

XE

ND

Iteration 2

XB

EG

IN

XE

ND

Iteration 3

XB

EG

IN

Iteration 3re-execution

XB

EG

IN

XE

ND

Conflict

Page 12: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation12

IBM Research - Tokyo

Ordered Transactions

� Must commit in the same order as sequential execution.

– Because data independence can be guaranteed only after all of the preceding iterations have committed.

Iteration 1

XB

EG

IN

XE

NDIteration 2

XB

EG

IN

XE

ND

Iteration 3

XB

EG

IN Commit orderinversion

Page 13: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation13

IBM Research - Tokyo

Ordered Transactions by Software� Hardware support by proposed TLS systems

– Wait until the preceding iterations commit.

� Software implementation by checking commit order

– Use a global variable to indicate the next iteration to commit.

– Abort if cannot commit.

Iteration 1

XB

EG

IN

XE

ND

Iteration 2

XB

EG

IN

XE

ND

Iteration 3

XB

EG

IN

Cancommit?

Cancommit?

Cancommit?

Iteration 3re-executionX

BE

GIN

XE

ND

Cancommit?

Page 14: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation14

IBM Research - Tokyo

Ordered Transactions by Software� Hardware support by proposed TLS systems

– Wait until the preceding iterations commit.

� Software implementation by checking commit order

– Use a global variable to indicate the next iteration to commit.

– Abort if cannot commit.

Iteration 1

XB

EG

IN

XE

ND

Iteration 2

XB

EG

IN

XE

ND

Iteration 3

XB

EG

IN

Cancommit?

Cancommit?

Cancommit?

Iteration 3re-executionX

BE

GIN

XE

ND

Cancommit?

Why not spin-wait?Refer to our paper….

Page 15: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation15

IBM Research - Tokyo

Our Goal

� How poorly can TLS improve the performance on real HTM hardware?

� What kind of hardware support should be implemented next in the off-the-shelf HTM?

– Will hardware support for ordered transactions really help?

Page 16: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation16

IBM Research - Tokyo

False Sharing due to Cache-Line Granularity Conflict Detection

double array[];...for (int i = ...; i < ...; i++) {

...array[i] = ...;...

}

Cache line = 64 bytes on x86

Writes by Thread 1

Writes by Thread 2

Writes by Thread 3

array[]

TLS

Page 17: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation17

IBM Research - Tokyo

Transaction Coarsening to Avoid False Sharing

Iteration 1X

BE

GIN

XE

ND

Iteration 2 Iteration 8…

Iteration 9

XB

EG

IN

XE

ND

Iteration 10 Iteration 16…

Iteration 17

XB

EG

IN

XE

ND

Iteration 18 Iteration 24…

array[]

Writes by Thread 1Writes by Thread 3 Writes by Thread 2

Page 18: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation18

IBM Research - Tokyo

Benchmarks and Methodology

� SPEC CPU2006

– 6 benchmarks showing more than 1.5-fold speedups with 4 threads in a previous TLS study [Packirisamy et al., 2009]

– 429.mcf, 433.milc, 456.hmmer, 464.h264ref, 470.lbm, and 482.sphinx3

� Manually modified frequently executed loops.

– Inserted XBEGIN, XEND, and commit order checks.

– Transformed a target loop into a doubly-nested loop for transaction coarsening

� Experimental environment

– Core i7-4770 processor (4 cores, 2-way SMT)

– 4-GB memory

– Linux 2.6.32-431 / GCC 4.9.0

Page 19: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation19

IBM Research - Tokyo

Normalized Throughput Results

� Up to 11% speedups with 2 or 4 threads.

� But mostly degraded the throughput.

0

0.5

1

1.5

0 1 2 3 4 5 6 7 8 9Number of software threads

Thr

ough

put

(1 =

seq

uent

ial)

429.mcf 433.milc456.hmmer 464.h264ref470.lbm 482.sphinx3

SMT enabled

Higher is better

Page 20: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation20

IBM Research - Tokyo

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9

Number of software threads

Abo

rt r

atio

(%

)

Total Order inversionOverflow ConflictOther

433.milc

� Parallel program

– Loop coverage: 23%

� Commit order inversion is a dominant abort reason.� Hardware support for ordered transactions will help.

0

0.5

1

1.5

0 1 2 3 4 5 6 7 8 9Number of software threads

Thr

ough

put

(1 =

seq

uent

ial)

Page 21: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation21

IBM Research - Tokyo

429.mcf

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

Number of software threads

Abo

rt r

atio

(%

)

Total

Order inversion

Buffer overflow

Conflict

Other

456.hmmer

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

Number of software threads

Ab

ort r

atio

(%

)

Total

Order inversion

Buffer overflow

Conflict

Other

Abort Statistics (1/2)433.milc

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

Number of software threads

Ab

ort r

atio

(%

)

Total

Order inv

Buffer ove

Conflict

Other

� Conflicts were a dominant abort reason in all of the benchmarks except 433.milc.

Page 22: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation22

IBM Research - Tokyo

Abort Statistics (2/2)464.h264ref

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

Number of software threads

Ab

ort r

atio

(%

)

Total

Order inversion

Buffer overflow

Conflict

Other

482.sphinx3

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

Number of software threads

Ab

ort r

atio

(%

)

Total

Order inversion

Buffer overflow

Conflict

Other

470.lbm

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

Number of software threads

Ab

ort r

atio

(%

)

Total

Order inv

Buffer ove

Conflict

Other

� Conflicts were a dominant abort reason in all of the benchmarks except 433.milc.

Page 23: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation23

IBM Research - Tokyo

Reasons for Conflicts and Possible Hardware Support

Multi-version cacheWAR dependence464.h264ref

(Fix in prefetcher)WAW dependence (false sharing by prefetching)

(Fix in prefetcher)WAW dependence (false sharing by prefetching)

482.sphinx3

Word-level conflict detection

WAW dependence (false sharing)

470.lbm

Data forwardingRAW dependence456.hmmer

No433.milc

Data forwardingRAW dependence429.mcf

Possible hardware support

Conflict reasonBenchmark

Page 24: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation24

IBM Research - Tokyo

Examples of Read-After-Write Data Dependence

� Hardware support already proposed in TLS literatures.

– Data forwarding.

429.mcf 456.hmmerstatic int size;static DATA array[N];func() {

...for (...) {

...if (...) {

size++;array[size]->field = ...;

}}...

}

for (k = 1; k <= M; k++) {...dc[k] = dc[k-1] + ...;...

}

Page 25: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation25

IBM Research - Tokyo

Example of Write-After-Read Data Dependence

� Difficult to analyze by a compiler.

– WAR dependence across different functions in different source files.

� Multi-version caches needed.

464.h264ref

for (...) {...line = func();... = line[0];...

}

static DATA line[N];

DATA *func() {...line[0] = ...;...return line;

}

Page 26: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation26

IBM Research - Tokyo

Conflicts Precede Commit Order Inversion

� Commit order matters only when most of the transactions reach the committing points.

� With data dependence, most of the transactions cannot run to the end.

Iteration 1

XB

EG

IN

XE

ND

Iteration 2

XB

EG

IN

XE

ND

Iteration 3

XB

EG

IN

ConflictCommit orderinversion

Page 27: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation27

IBM Research - Tokyo

Conflicts due to Prefetching

� Even with transaction coarsening, conflicts still happened.

– 464.h264ref and 482.sphinx3.

� Prefetched adjacent cache lines caused conflicts.

64 bytes 64 bytes 64 bytes

Writes by Thread 1

Writes by Thread 2

Prefetch

Prefetch

Conflict

Page 28: Thread-Level Speculation on Off-the-Shelf Hardware ... - IBM · © 2014 IBM Corporation Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike

© 2014 IBM Corporation28

IBM Research - Tokyo

Conclusion

� How well can TLS improve the performance on real HTM hardware?

– Up to 11% speedups with 4 threads in SPEC CPU2006 on 4th Generation Core Processor.

– But degraded throughput in most cases.

� What kind of hardware support should be implemented next in the off-the-shelf HTM?

– Hardware support for ordered transactions will help in parallel programs.

– However, many programs contain data dependence.� Not only ordered transactions, but also other hardware facilities to avoid conflicts should be implemented.

– (Intel should fix the adjacent cache line prefetcher!)