Rational Apex 4.0 Optimization
description
Transcript of Rational Apex 4.0 Optimization
Rational Apex 4.0 OptimizationRational Apex 4.0 Optimization
“Beware the benchmark!”“Beware the benchmark!”
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Presentation OutlinePresentation Outline
Outline Rational Apex optimization behaviour Demonstrate some of the optimization techniques being
used by modern compilers Show how these techniques defeat many of the
assumptions made by traditional benchmarking suites
Outline Rational Apex optimization behaviour Demonstrate some of the optimization techniques being
used by modern compilers Show how these techniques defeat many of the
assumptions made by traditional benchmarking suites
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Rational Apex OptimizationRational Apex Optimization
Optimization with Apex has 3 levels, controlled by the OPTIMIZATION_LEVEL switch Level 0 – No optimization, maximize debuggability
• This is the default
Level 1 – Many optimizations performed, some debuggability maintained
Level 2 – All optimizations performed, debugging may be very limited in some code
Optimization with Apex can have one of two objectives Time – try to generate code with that will execute in minimal time Space – try to generate code that is as compact as possible These two objectives are not mutually exclusive!
Optimization with Apex has 3 levels, controlled by the OPTIMIZATION_LEVEL switch Level 0 – No optimization, maximize debuggability
• This is the default
Level 1 – Many optimizations performed, some debuggability maintained
Level 2 – All optimizations performed, debugging may be very limited in some code
Optimization with Apex can have one of two objectives Time – try to generate code with that will execute in minimal time Space – try to generate code that is as compact as possible These two objectives are not mutually exclusive!
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Rational Apex OptimizationRational Apex Optimization Apex performs optimization in several different places
Front End – post semantics• Common sub-expression elimination• Code in-lining• Loop unrolling• Remove unused code from local scope
Machine independent instruction stream optimizer “optim”• Loop invariant hoisting• Range propogation• Constraint check elimination• Reduce memory movement
Machine specific code generator• Peep-hole optimization
All optimization consumes extra CPU during compilation The default is off – OPTIMIZATION_LEVEL: 0
Apex performs optimization in several different places Front End – post semantics
• Common sub-expression elimination• Code in-lining• Loop unrolling• Remove unused code from local scope
Machine independent instruction stream optimizer “optim”• Loop invariant hoisting• Range propogation• Constraint check elimination• Reduce memory movement
Machine specific code generator• Peep-hole optimization
All optimization consumes extra CPU during compilation The default is off – OPTIMIZATION_LEVEL: 0
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Example Code – Summation of SQRTExample Code – Summation of SQRT
Simple routine that sums up square roots and prints the result
Simple routine that sums up square roots and prints the result
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Example Code – Body of G_E_F.SqrtExample Code – Body of G_E_F.Sqrt
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Example Code – Body of G_E_F.HardwareExample Code – Body of G_E_F.Hardware
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 0Optimization Level 0
No inlining, no code elimination, no check elimination Disassembly of sum_sqrt.2.ada is 15845 lines long
No unused code has been eliminated – all the code for generic_elementary_functions remains
No inlining, no code elimination, no check elimination Disassembly of sum_sqrt.2.ada is 15845 lines long
No unused code has been eliminated – all the code for generic_elementary_functions remains
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 0 – Disassembly of “for” loopOptimization Level 0 – Disassembly of “for” loop
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 0 – Disassembly of sqrtOptimization Level 0 – Disassembly of sqrt
163 lines ofassembly
Slightlyabridged
163 lines ofassembly
Slightlyabridged
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 0 – Disassembly of hardwareOptimization Level 0 – Disassembly of hardware
56 Lines of disassembly for SQRT 10 Instructions for SQRT_32
56 Lines of disassembly for SQRT 10 Instructions for SQRT_32
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 0 – SummaryOptimization Level 0 – Summary Total of over 220 instructions generated for the code that we are
interested in Lots of it will be unused Not to mention the rest of the code for the instantiation
Code maps back to source easily Code layout follows source Lots of overhead for this straightforward code
Subprogram prolog/epilog code• Stack checks• Register management
Subprogram call/return code (3 levels deep) No delayed branch slots being filled
Total of over 220 instructions generated for the code that we are interested in Lots of it will be unused Not to mention the rest of the code for the instantiation
Code maps back to source easily Code layout follows source Lots of overhead for this straightforward code
Subprogram prolog/epilog code• Stack checks• Register management
Subprogram call/return code (3 levels deep) No delayed branch slots being filled
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 2 – Disassembly of “for” loopOptimization Level 2 – Disassembly of “for” loop
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 2 – ObservationsOptimization Level 2 – Observations
Disassembly of sum_sqrt.2.ada is 85 lines long Entire loop and all the called subprogram code is now
12 instructions long 5 instructions for “for” loop management
• Includes 2 instructions for branching
4 instructions for integer to float conversion• 2 are identical, as one copy is used to fill a delayed branch slot at the bottom of
the loop
1 instruction for the Text_Io code is used to fill a branch delay slot
2 Instructions to perform the actual Sqrt and summation.
Disassembly of sum_sqrt.2.ada is 85 lines long Entire loop and all the called subprogram code is now
12 instructions long 5 instructions for “for” loop management
• Includes 2 instructions for branching
4 instructions for integer to float conversion• 2 are identical, as one copy is used to fill a delayed branch slot at the bottom of
the loop
1 instruction for the Text_Io code is used to fill a branch delay slot
2 Instructions to perform the actual Sqrt and summation.
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Optimization Level 2 – ObservationsOptimization Level 2 – Observations The optimization objective was Time
Time is certainly optimized, but Space also benefited enormously Different optimization techniques combined effectively to
produce very effective code Inlining of 3 levels of subprogram call eliminated a significant amount of
subprogram prolog/epilog Range propagation determined that the argument to SQRT could never
be less than zero, which allowed the argument check to be removed Evaluation of compile static expressions resulted in a lot of code not
being generated• Kind of floating point type – no case statement needed• Availability of Hardware SQRT – no call needed to Has_Sqrt
Register lifetime analysis on the resulting code meant that the loop control variable and the summation variable could live in registers
The optimization objective was Time Time is certainly optimized, but Space also benefited enormously
Different optimization techniques combined effectively to produce very effective code Inlining of 3 levels of subprogram call eliminated a significant amount of
subprogram prolog/epilog Range propagation determined that the argument to SQRT could never
be less than zero, which allowed the argument check to be removed Evaluation of compile static expressions resulted in a lot of code not
being generated• Kind of floating point type – no case statement needed• Availability of Hardware SQRT – no call needed to Has_Sqrt
Register lifetime analysis on the resulting code meant that the loop control variable and the summation variable could live in registers
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Performing BenchmarksPerforming Benchmarks
Benchmarks usually consist of two distinct loops A “Null Timing” loop to determine the overhead of the loop
code itself The Code Under Test loop which has the same structure as
the Null timing loop with the inside of the loop replaced with the C.U.T
Timing equation looks like TCUT = (TCUT_loop – Tnull_loop) / n
• Where n is the number of iterations• Usually n has to be very high so that the resolution of the system clock is not
significant in the result
Benchmarks usually consist of two distinct loops A “Null Timing” loop to determine the overhead of the loop
code itself The Code Under Test loop which has the same structure as
the Null timing loop with the inside of the loop replaced with the C.U.T
Timing equation looks like TCUT = (TCUT_loop – Tnull_loop) / n
• Where n is the number of iterations• Usually n has to be very high so that the resolution of the system clock is not
significant in the result
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Performing BenchmarksPerforming Benchmarks
One effect we notice is that sometimes a benchmark suite reports slower times for code even though we know we have improved our optimizations!
What’s happening? The Null Timing loops of benchmark suites attempt to defeat
compiler optimizations that skew their results Compilers are better at getting rid of unnecessary code, often
defeating the smart null loop So now the equation looks like: TCUT = (TCUT_loop – 0 ) / n So the remaining loop overhead time gets included in the
time of the Code Under Test making it look worse than before
One effect we notice is that sometimes a benchmark suite reports slower times for code even though we know we have improved our optimizations!
What’s happening? The Null Timing loops of benchmark suites attempt to defeat
compiler optimizations that skew their results Compilers are better at getting rid of unnecessary code, often
defeating the smart null loop So now the equation looks like: TCUT = (TCUT_loop – 0 ) / n So the remaining loop overhead time gets included in the
time of the Code Under Test making it look worse than before
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Performing BenchmarksPerforming Benchmarks
One other effect we observe is that benchmarks often don’t do anything with the results they calculate
Compilers can detect this and conclude that running the code has no effect and (very importantly) no side-effects Range propagation concludes that overflow cannot be raised Result is never used Code is thrown away
A good example is the Henessey Benchmark in the PIWG suite Large matrix multiplications, using a range of values that will not result
in overflow Apex 4.0 reports zero time for that test
One other effect we observe is that benchmarks often don’t do anything with the results they calculate
Compilers can detect this and conclude that running the code has no effect and (very importantly) no side-effects Range propagation concludes that overflow cannot be raised Result is never used Code is thrown away
A good example is the Henessey Benchmark in the PIWG suite Large matrix multiplications, using a range of values that will not result
in overflow Apex 4.0 reports zero time for that test
Do not use Gradient or transparentfills for slides to be used on
PlaceWare.com
Performing BenchmarksPerforming Benchmarks
When trying to compare different compiler technologies you need to look beyond the results printed by a benchmark program Printed numbers can be very misleading Look at absolute times and iteration counts Benchmarks don’t translate well b/n processor variants and
processor types The best benchmark is your application
Or a sizable portion of it
When trying to compare different compiler technologies you need to look beyond the results printed by a benchmark program Printed numbers can be very misleading Look at absolute times and iteration counts Benchmarks don’t translate well b/n processor variants and
processor types The best benchmark is your application
Or a sizable portion of it