a scalable approach to thread-level speculation j. gregory steffan, christopher b. colohan, antonia...

27
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Post on 19-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

A Scalable Approach to Thread-Level Speculation

J. Gregory Steffan, Christopher B. Colohan,

Antonia Zhai, and Todd C. Mowry

Carnegie Mellon University

Page 2: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion

Page 3: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Motivation Leading chip manufactures going for multi-

core architectures Usually used to increase throughput To exploit these parallel resources to increase

performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation

(TLS)!

Page 4: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Thread level speculation (TLS)

Page 5: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Scalable Approach The paper aims to design a scalable approach

which applies to wide variety of multi-processor like architectures

Only limitation is that the architecture should be shared memory based

The TLS is implemented over the invalidation based cache coherence protocol

Page 6: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Example Each cache line has special bits

SL – speculative load has accessed the line SM – the line is speculatively modified

Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread

Page 7: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Speculation level We are concerned only

with the speculation level – level in the cache hierarchy where the cache protocol begins

We can ignore all the other levels

Page 8: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Cache line states Apart from the cache

state bits we need SL and SM bits

A cache line with speculative bits set cannot be replaced

The thread is either squashed or the operation is delayed

Page 9: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Basic cache coherence protocol When a processor wants to load a value, it

atleast needs shared access to the line When it wants to write, it needs exclusive

access Coherence mechanism issues invalidation

message when it receives request for exclusive access

Page 10: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Coherence mechanism

Page 11: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Commit When the homefree token arrives there is no

possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared,

we have to get exclusive access for that line Ownership required buffer (ORB) is used to track

such lines

Page 12: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Squash All speculatively modified lines have to be

invalidated SpE is changed to E and SpS to S

Page 13: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Performance Optimizations

Forwarding Data Between Epochs: Predictable data dependences are synchronized

Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is

flushed – this can be avoided Suspending Violations:

When we have to evict a speculative line, we don’t need to squash

Page 14: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Multiple writers If two epochs write to the same line – we

have to squash one to avoid multiple writer problem

Possible to avoid this by maintaining fine grained disambiguation bits

Page 15: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Implementation

Page 16: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every

access – the difference is precomputed and a logically later mask is formed

Epoch numbers are maintained at one place for one chip

Page 17: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Speculative state implementation

Page 18: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Multiple writers - implementation False violations are also handled in the same

way

Page 19: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the

homefree token is got System calls are also postponed

Page 20: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Methodology Detailed out-of-order simulation based on

MIPS R10000 is done Fork and other synchronization overhead is

10 cycles

Page 21: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Results Normalized execution cycles

Page 22: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Results Buk and equake – memory performance is a

bottleneck When increased more than 4 processors ijpeg

performance degrades Number of threads available is less Some conflicts in cache

Page 23: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Overheads Violations

Cache locality is important ORB size can be further reduced – early release of

ORB

Page 24: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Communication overhead Buk is insensitive

Page 25: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Multiprocessor performance Advantages

More cache storage Disadvantage

Increased communication latency

Page 26: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Conclusion By using TLS even integer programs can be

parallelized to get speedup The approach is scalable and can be applied

to various other architectures which support multiple threads

There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

Page 27: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

Thanks!