CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic...
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic...
CS 7810 Lecture 18
The Potential for Using Thread-Level DataSpeculation to Facilitate Automatic Parallelization
J.G. Steffan and T.C. MowryProceedings of HPCA-4
February 1998
Multi-Threading
• CMPs advocate low complexity and static approaches to parallelism extraction
• Resolving memory dependences for integer codes is not easy!
Large window100 in-flight instrs
Compiler-generated threads4 windows of 25 instrs each
Probable Conflicts
p
q
Example: Compress
Example Execution
• Bullet
Compiler Optimizations
• Induction variables: in_count
• Reduction: out_count
• Parallel I/O: getchar() and putchar()
• Scalar forwarding: free_entries
• Ambiguous loads and stores: hash[…]
Methodology
• Threads (epochs) were constructed by hand
• The procs are in-order and instrs are unit latency
Ambiguous Loads and Stores
Average Run Lengths
Forwarding Registers and Scalars
Average Run Lengths
Realistic Models
• 10-cycle forwarding latency• Sharing at cache line granularity• Recovery from misspeculation• Results are not sensitive to forwarding latency or cache line size
Hardware Support
• Cache coherence protocol for the L1 caches
• For each cache line, keep track of whether the line has been read/modified
• When the oldest thread writes to a cache line, an invalidate is sent to the other caches
• The younger thread sets a violation flag if the younger thread has speculatively loaded the line -- s/w recovery is initiated when the thread commits
• Cache line evicts cause violations (not common)
Role of the Compiler
• Profiling to identify epochs large enough to offset thread management and communication cost; small enough to have low speculative state
• Estimate probability of violation (static/dynamic)
• Optimizations (induction, reduction, parallel I/O)
• Scalar forwarding and rescheduling
• Insertion of register recovery code
Conclusions
• Hardware catches violations; compiler can parallelize aggressively
• Competitive implementation: large window with store sets prediction
Title
• Bullet