A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia...

27
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia...

Page 1: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

A Scalable Approach to Thread-Level Speculation

J. Gregory Steffan, Christopher B. Colohan,

Antonia Zhai, and Todd C. Mowry

Carnegie Mellon University

Page 2: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion

Page 3: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Motivation Leading chip manufactures going for multi-

core architectures Usually used to increase throughput To exploit these parallel resources to increase

performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation

(TLS)!

Page 4: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Thread level speculation (TLS)

Page 5: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Scalable Approach The paper aims to design a scalable approach

which applies to wide variety of multi-processor like architectures

Only limitation is that the architecture should be shared memory based

The TLS is implemented over the invalidation based cache coherence protocol

Page 6: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Example Each cache line has special bits

SL – speculative load has accessed the line SM – the line is speculatively modified

Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread

Page 7: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Speculation level We are concerned only

with the speculation level – level in the cache hierarchy where the cache protocol begins

We can ignore all the other levels

Page 8: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Cache line states Apart from the cache

state bits we need SL and SM bits

A cache line with speculative bits set cannot be replaced

The thread is either squashed or the operation is delayed

Page 9: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Basic cache coherence protocol When a processor wants to load a value, it

atleast needs shared access to the line When it wants to write, it needs exclusive

access Coherence mechanism issues invalidation

message when it receives request for exclusive access

Page 10: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Coherence mechanism

Page 11: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Commit When the homefree token arrives there is no

possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared,

we have to get exclusive access for that line Ownership required buffer (ORB) is used to track

such lines

Page 12: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Squash All speculatively modified lines have to be

invalidated SpE is changed to E and SpS to S

Page 13: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Performance Optimizations

Forwarding Data Between Epochs: Predictable data dependences are synchronized

Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is

flushed – this can be avoided Suspending Violations:

When we have to evict a speculative line, we don’t need to squash

Page 14: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Multiple writers If two epochs write to the same line – we

have to squash one to avoid multiple writer problem

Possible to avoid this by maintaining fine grained disambiguation bits

Page 15: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Implementation

Page 16: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every

access – the difference is precomputed and a logically later mask is formed

Epoch numbers are maintained at one place for one chip

Page 17: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Speculative state implementation

Page 18: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Multiple writers - implementation False violations are also handled in the same

way

Page 19: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the

homefree token is got System calls are also postponed

Page 20: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Methodology Detailed out-of-order simulation based on

MIPS R10000 is done Fork and other synchronization overhead is

10 cycles

Page 21: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Results Normalized execution cycles

Page 22: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Results Buk and equake – memory performance is a

bottleneck When increased more than 4 processors ijpeg

performance degrades Number of threads available is less Some conflicts in cache

Page 23: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Overheads Violations

Cache locality is important ORB size can be further reduced – early release of

ORB

Page 24: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Communication overhead Buk is insensitive

Page 25: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Multiprocessor performance Advantages

More cache storage Disadvantage

Increased communication latency

Page 26: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Conclusion By using TLS even integer programs can be

parallelized to get speedup The approach is scalable and can be applied

to various other architectures which support multiple threads

There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

Page 27: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Thanks!