CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic...

16
CS 7810 Lecture 18 The Potential for Using Thread-Level Data culation to Facilitate Automatic Paralleliza J.G. Steffan and T.C. Mowry Proceedings of HPCA-4 February 1998
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic...

Page 1: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

CS 7810 Lecture 18

The Potential for Using Thread-Level DataSpeculation to Facilitate Automatic Parallelization

J.G. Steffan and T.C. MowryProceedings of HPCA-4

February 1998

Page 2: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Multi-Threading

• CMPs advocate low complexity and static approaches to parallelism extraction

• Resolving memory dependences for integer codes is not easy!

Large window100 in-flight instrs

Compiler-generated threads4 windows of 25 instrs each

Page 3: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Probable Conflicts

p

q

Page 4: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Example: Compress

Page 5: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Example Execution

• Bullet

Page 6: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Compiler Optimizations

• Induction variables: in_count

• Reduction: out_count

• Parallel I/O: getchar() and putchar()

• Scalar forwarding: free_entries

• Ambiguous loads and stores: hash[…]

Page 7: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Methodology

• Threads (epochs) were constructed by hand

• The procs are in-order and instrs are unit latency

Page 8: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Ambiguous Loads and Stores

Page 9: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Average Run Lengths

Page 10: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Forwarding Registers and Scalars

Page 11: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Average Run Lengths

Page 12: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Realistic Models

• 10-cycle forwarding latency• Sharing at cache line granularity• Recovery from misspeculation• Results are not sensitive to forwarding latency or cache line size

Page 13: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Hardware Support

• Cache coherence protocol for the L1 caches

• For each cache line, keep track of whether the line has been read/modified

• When the oldest thread writes to a cache line, an invalidate is sent to the other caches

• The younger thread sets a violation flag if the younger thread has speculatively loaded the line -- s/w recovery is initiated when the thread commits

• Cache line evicts cause violations (not common)

Page 14: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Role of the Compiler

• Profiling to identify epochs large enough to offset thread management and communication cost; small enough to have low speculative state

• Estimate probability of violation (static/dynamic)

• Optimizations (induction, reduction, parallel I/O)

• Scalar forwarding and rescheduling

• Insertion of register recovery code

Page 15: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Conclusions

• Hardware catches violations; compiler can parallelize aggressively

• Competitive implementation: large window with store sets prediction

Page 16: CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

Title

• Bullet