Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia...
A Scalable Approach to Thread-Level Speculation
J. Gregory Steffan, Christopher B. Colohan,
Antonia Zhai, and Todd C. Mowry
Carnegie Mellon University
Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion
Motivation Leading chip manufactures going for multi-
core architectures Usually used to increase throughput To exploit these parallel resources to increase
performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation
(TLS)!
Thread level speculation (TLS)
Scalable Approach The paper aims to design a scalable approach
which applies to wide variety of multi-processor like architectures
Only limitation is that the architecture should be shared memory based
The TLS is implemented over the invalidation based cache coherence protocol
Example Each cache line has special bits
SL – speculative load has accessed the line SM – the line is speculatively modified
Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread
Speculation level We are concerned only
with the speculation level – level in the cache hierarchy where the cache protocol begins
We can ignore all the other levels
Cache line states Apart from the cache
state bits we need SL and SM bits
A cache line with speculative bits set cannot be replaced
The thread is either squashed or the operation is delayed
Basic cache coherence protocol When a processor wants to load a value, it
atleast needs shared access to the line When it wants to write, it needs exclusive
access Coherence mechanism issues invalidation
message when it receives request for exclusive access
Coherence mechanism
Commit When the homefree token arrives there is no
possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared,
we have to get exclusive access for that line Ownership required buffer (ORB) is used to track
such lines
Squash All speculatively modified lines have to be
invalidated SpE is changed to E and SpS to S
Performance Optimizations
Forwarding Data Between Epochs: Predictable data dependences are synchronized
Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is
flushed – this can be avoided Suspending Violations:
When we have to evict a speculative line, we don’t need to squash
Multiple writers If two epochs write to the same line – we
have to squash one to avoid multiple writer problem
Possible to avoid this by maintaining fine grained disambiguation bits
Implementation
Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every
access – the difference is precomputed and a logically later mask is formed
Epoch numbers are maintained at one place for one chip
Speculative state implementation
Multiple writers - implementation False violations are also handled in the same
way
Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the
homefree token is got System calls are also postponed
Methodology Detailed out-of-order simulation based on
MIPS R10000 is done Fork and other synchronization overhead is
10 cycles
Results Normalized execution cycles
Results Buk and equake – memory performance is a
bottleneck When increased more than 4 processors ijpeg
performance degrades Number of threads available is less Some conflicts in cache
Overheads Violations
Cache locality is important ORB size can be further reduced – early release of
ORB
Communication overhead Buk is insensitive
Multiprocessor performance Advantages
More cache storage Disadvantage
Increased communication latency
Conclusion By using TLS even integer programs can be
parallelized to get speedup The approach is scalable and can be applied
to various other architectures which support multiple threads
There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible
Thanks!