Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May...
-
Upload
loreen-stevens -
Category
Documents
-
view
222 -
download
2
Transcript of Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May...
Adapting Convergent Scheduling Using Machine
Learning
Diego Puppin*, Mark Stephenson†, Una-May O’Reilly†, Martin Martin†, and
Saman Amarasinghe†
*Institute for Information Science and Technologies, Italy† , Massachusetts Institute of Technology USA
Outline
This talk shows how one can apply machine learning techniques to find good phase orderings for an instruction scheduler
First, I’ll introduce the scheduler that we are interested in improving
Then, I’ll discuss genetic programmingThen, I’ll present experimental results
R4000 likeProcessorCore
Operandnetwork
Clustered Architectures
Memory and registers separated into clustersRAWClustered VLIWs
When scheduling, we try to co-locate data with computation
Convergent Scheduling
Convergent scheduling passes are symmetric
Each pass takes as input a preference map and outputs a preference map
Passes are modular and can be applied in any order
Convergent SchedulingPreference Maps
Inst
ruct
ions
Clusters
Tim
e
0 1 2 3
4
5
6
7
0
1
2
3
Each entry is a weightThe weights correspond
to the “confidence” of a space-time assignment for a given instruction
Four clusters
High confidence
Low confidence
Example Dependence Graph
Critical Path Strengthening
Path Propagation
Parallelism Distribute
Path Propagation
Final Schedule
Convergent Scheduling
“Classical” scheduling passes make absolute decisions that can’t be undone
Convergent scheduling passes make soft decisions in the form of preferencesMistakes made early on can be undone
Passes don’t impose order!
Pass Pass
Double-Edged Sword
The good news: convergent scheduling does not constrain phase orderNice interface makes writing and integrating
passes easy
The bad news: convergent scheduling does not constrain phase orderLimitless number of phase orders to consider,
some of which are much better than others
Our Proposal
Use genetic programming to automatically search for a phase ordering that’s catered to a givenArchitectureCompiler
Our inspiration comes from Cooper’s work [Cooper et al., LCTES 1999]
Genetic Programming
Searching algorithm analogous to Darwinian evolutionMaintain a population of expressions
(sequence INITTIME (sequence PLACE (if imbalanced LOAD COMM)))
Genetic Programming
Searching algorithm analogous to Darwinian evolutionMaintain a population of expressionsSelection
The fittest expressions in the population are more likely to reproduce
ReproductionCrossing over subexpressions of two expressions
Mutation
General Flow
Create initial population(initial solutions)
Evaluation
Selection
Randomly generated initial population
Create Variants
done?
General Flow
Create initial population(initial solutions)
Evaluation
Selection
Create Variants
done?
Compiler is modified to use the given expression as the phase ordering
Each expression is evaluated by compiling and running the benchmark(s)
Fitness is the relative speedup over our original phase ordering on the benchmark(s)
General Flow
Create initial population(initial solutions)
Evaluation
Selection
Create Variants
done?
Just as with Natural Selection, the fittest individuals are more likely to survive
General Flow
Create initial population(initial solutions)
Evaluation
Selection
Create Variants
done?
Use crossover and mutation to generate new expressions
And thus, generate new and hopefully improved phase orderings
Experimental Setup
We use an in-house VLIW compiler (SUIF, MachSUIF) and simulator
Compiler and simulator are parameterized so we can easily change VLIW configurations
Experiments presented here are for clustered architecturesDetails of the architectures are in the paper
Convergent Scheduling Heuristics
Noise Introduction Initial Time Assignment Preplacement Critical Path Strengthening Communication Minimization Parallelism Distribution Load Balance Dependence Enforcement Assignment Strengthening Functional Unit Distribution Push to first cluster Critical Path Distance Cluster Creation Register Pressure Reduction in Time Register Pressure Reduction in Space
Hand-Tuned Results4-cluster VLIW, Rich Interconnect
0
0.5
1
1.5
2
2.5
3
3.5
4
vvmul rbsorf yuv tomcatv mxm fir cholesky
Spe
edup
PCC
UAS
Convergent
Results4-cluster VLIW, Limited Interconnect
Training an Improved Sequence
Goal: find a sequence that works well for all the benchmarks in the last graph (vmul, rbsorf, yuv, etc.)
Train a sequence using these benchmarks then…For each expression in the population compile
and run all the benchmarks, take the average speedup as fitness
The Schedule
Evolved sequence is much more conservative in communication
inittime func dep func load func dep func comm dep func comm place
func reduces weights of instructions on overloaded clusters
dep increases probability that dependent instruction scheduled “nearby”
comm tries to keep neighboring instructions in same cluster
Results4-cluster VLIW, Limited Interconnect
ResultsLeave-One-Out Cross Validation
Summary of Results
When we changed the architecture, the hand-tuned sequence failedUAS and PCC outperform convergent
schedulingOur GP system found a sequence that
usually outperforms UAS and PCCCross validation suggests that it is
possible to find a “general-purpose” sequence
Running Time
Using about 20 machines in a small cluster of workstations it takes about 2 days to evolve a sequence
This is a one-time process!Performed by the compiler vendor
Disappointing Result
Unfortunately, sequences with conditionals are weeded out of the GP selection processOur system rewards parsimonyConvergent scheduling passes make soft
decisions, so running an extra pass may not be detrimental
We’d like to get to the bottom of this unexpected result
Conclusions
Using GP we’re able to find architecture-specific, application-independent sequences
We can quickly retune the compiler whenThe architecture changesThe compiler itself changes
Implemented Tests