Beyond Auto-Parallelization: Compilers for Many-Core
Systems
Marcelo Cintra
University of Edinburghhttp://
www.homepages.inf.ed.ac.uk/mc
Moore for Less Keynote - September 2008 2
Compilers for Parallel Computers (Today) Auto-parallelizing compilers
– “Holy grail”: convert sequential programs into parallel programs with little or no user intervention
– Only partial success, despite decades of work– No performance debugging tools
For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads)– Main goal: correctly map high-level data and control
flow to hardware/OS threads and communication– Secondary goal: perform simple optimizations specific
to parallel execution– Simple correctness and performance debugging tools
Moore for Less Keynote - September 2008 3
Compilers for Parallel Computers (Future)
Data flow/dependence analysis tools – unsafe/speculative– Probabilistic approaches– Profile-based approaches
Multithreading-specific optimization toolbox– Including alternative/speculative parallel programming models
(e.g., Transactional Memory (TM))
Auto-parallelizing compilers – with speculation– Thread-level speculation (TLS)– Helper threads
Holistic parallelizing tool chain.
Moore for Less Keynote - September 2008 4
Why Be Speculative?
Performance of programs ultimately limited by control and data flows
Most compiler optimizations exploit knowledge of control and data flows
Techniques based on complete/accurate knowledge of control and data flows are reaching their limit– True for both sequential and parallel
optimizationsFuture compiler optimizations must relyon incomplete knowledge: speculative execution
Moore for Less Keynote - September 2008 5
Compilers for Parallel Computers (Future)
Dependence/FlowAnalysis Tool
ParallelizingCompiler
Unsafe
<P-wayparallel
Seq.
P-wayparallel
TLSTM
Auto-TLSCompiler
Auto-TLSCompiler
Moore for Less Keynote - September 2008 6
Outline
Context and Motivation History and status-quo of auto-parallelizing
compilers– Data dependence analysis for array-based programs– Data dependence analysis for irregular programs
Auto-parallelizing compilers for TLS– TLS execution model (speculative parallelization)– Static compiler cost model (PACT’04, TACO’07)
Moore for Less Keynote - September 2008 7
Data Dependence Analysis for Arrays
Based on mathematical evaluation of array index expressions within loop nests
Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions
Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism
Moore for Less Keynote - September 2008 8
Data Dependence Analysis for Arrays
What’s wrong with traditional data dependence?– Not all index expressions are affine or even
statically defined (e.g., subscripted subscripts)– Not all loops are well structured (e.g.,
conditional exits, control flow)– Not all procedures are analyzable (e.g.,
unavailable code, aliasing, global data access)
– Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests
Moore for Less Keynote - September 2008 9
Data Dependence Analysis for Irregular Programs
Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis)
There isn’t a comprehensive data dependence analysisframework for irregular applications
Moore for Less Keynote - September 2008 10
Outline
Context and Motivation History and status-quo of auto-parallelizing
compilers– Data dependence analysis for array-based programs– Data dependence analysis for irregular programs
Auto-parallelizing compilers for TLS– TLS execution model (speculative parallelization)– Static compiler cost model (PACT’04, TACO’07)
Moore for Less Keynote - September 2008 11
Thread Level Speculation (TLS)
Assume no dependences and execute threads in parallel While speculating, buffer speculative data separately Track data accesses and monitor cross-thread violations Squash offending threads and restart them
All this can be done in hardware, software, or a combination
for(i=0; i<100; i++) { … = A[L[i]] + …
A[K[i]] = …}
Iteration J+2… = A[5]+…
A[6] = ...
Iteration J+1… = A[2]+…
A[2] = ...
Iteration J… = A[4]+…
A[5] = ...RAW
Moore for Less Keynote - September 2008 12
Squash & restart: re-executing the threads Speculative buffer overflow: speculative
buffer is full, thread stalls until becomes non-speculative
Dispatch & commit: writing back speculative data into memory and starting next speculative thread
Load imbalance: processor waiting for thread to become non-speculative to commit
TLS Overheads
Moore for Less Keynote - September 2008 13
Coping with overheads: Cost Model!
Compiler cost models are key to guide optimizations, but no such cost model exists for TLS
Speculative parallelization can deliver significant speedup or slowdown– Several speculation overheads– Overheads are hard to estimate (e.g., squash?)
A prediction of the value of speedup can be useful– e.g. multi-tasking environment
program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 )
other programs waiting to be scheduled OS decides it does not pay off
Moore for Less Keynote - September 2008 14
Squash & restart: re-executing the threads– Hard because violations are highly unpredictable
Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative– Hard because write-sets are somewhat unpredictable
Dispatch & commit: writing back speculative data into memory and starting next speculative thread– Hard because write-sets are somewhat unpredictable
Load imbalance: processor waiting for thread to become non-speculative to commit– Hard because workloads are very unpredictable and
order does matter due to in-order commit requirement
TLS Overheads
Moore for Less Keynote - September 2008 15
Our Compiler Cost Model: Highlights
First fully static compiler cost model for TLS Can handle all TLS overheads in a single
framework– Including loop imbalance, which is not handled by any
other cost model Produces not only a qualitative (“good” or “bad”)
assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown)
Can be easily integrated into most compilers at the intermediate representation level
Simple and fast to compute
Moore for Less Keynote - September 2008 16
Speedup Distribution
Very varied speedup/slowdown behavior
0
10
20
30
40
50
60
70
80
mesa art equake ammp vpr mcf crafty vortex bzip2 average
Fra
ctio
n o
f lo
ops
(%)
0.5<S≤1 1<S≤2 2<S≤3 3<S≤4
Moore for Less Keynote - September 2008 17
0
20
40
60
80
100
120
mesa art equake ammp vpr mcf crafty vortex bzip2 average
Fra
ctio
n o
f lo
ops
(%)
Predict speedup/Actual speedup Predict slowdown/Actual slowdown
Predict speedup/Actual slowdown Predict slowdown/Actual speedup
Model Accuracy (I): Outcomes
Only 17% false positives(performance degradation)
Negligible false negatives(missed opportunities)
Most speedups/slowdownscorrectly predicted by the model
Moore for Less Keynote - September 2008 18
Current Developments
Done:– Completed implementation of TLS code generator in
GCC Doing:
– Implementing cost model in this TLS GCC– Profiling TLS program behavior (with IBM and U. of
Manchester) To Do:
– Develop hybrid cost models based on static and profile information
– Develop “intelligent” cost models based on Machine Learning (with U. of Manchester)
Moore for Less Keynote - September 2008 19
Summary
Paraphrasing M. Snir† (UIUC): “parallel programming will have to become synonymous with programming”
However,– Better (and unsafe) data dependence analysis tools– Explicit (and speculative) parallel models– Auto-parallelizing (speculative) compilers
Much work still needs to be done. At U. of Edinburgh:
– Auto-parallelizing TLS compilers– TLS hardware– STM (software TM)
† Director of Intel+Microsoft’s UPCRC
Moore for Less Keynote - September 2008 20
Acknowledgments
Research Team and Collaborators– Jialin Dou– Salman Khan– Polychronis Xekalakis– Nikolas Ioannou– Fabricio Goes– Constantino Ribeiro– Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of
Manchester)– Prof. Diego Llanos (U. of Valladolid)
Funding– UK – EPSRC: GR/R65169/01 EP/G000697/1
Beyond Auto-Parallelization: Compilers for Many-Core
Systems
Marcelo Cintra
University of Edinburghhttp://
www.homepages.inf.ed.ac.uk/mc
Top Related