Rational Apex 4.0 Optimization

Rational Apex 4.0 OptimizationRational Apex 4.0 Optimization

“Beware the benchmark!”“Beware the benchmark!”

Do not use Gradient or transparentfills for slides to be used on

PlaceWare.com

Presentation OutlinePresentation Outline

Outline Rational Apex optimization behaviour Demonstrate some of the optimization techniques being

used by modern compilers Show how these techniques defeat many of the

assumptions made by traditional benchmarking suites

Outline Rational Apex optimization behaviour Demonstrate some of the optimization techniques being

used by modern compilers Show how these techniques defeat many of the

assumptions made by traditional benchmarking suites


PlaceWare.com

Rational Apex OptimizationRational Apex Optimization

Optimization with Apex has 3 levels, controlled by the OPTIMIZATION_LEVEL switch Level 0 – No optimization, maximize debuggability

• This is the default

Level 1 – Many optimizations performed, some debuggability maintained

Level 2 – All optimizations performed, debugging may be very limited in some code

Optimization with Apex can have one of two objectives Time – try to generate code with that will execute in minimal time Space – try to generate code that is as compact as possible These two objectives are not mutually exclusive!

Optimization with Apex has 3 levels, controlled by the OPTIMIZATION_LEVEL switch Level 0 – No optimization, maximize debuggability

• This is the default

Level 1 – Many optimizations performed, some debuggability maintained

Level 2 – All optimizations performed, debugging may be very limited in some code

Optimization with Apex can have one of two objectives Time – try to generate code with that will execute in minimal time Space – try to generate code that is as compact as possible These two objectives are not mutually exclusive!


PlaceWare.com

Rational Apex OptimizationRational Apex Optimization Apex performs optimization in several different places

Front End – post semantics• Common sub-expression elimination• Code in-lining• Loop unrolling• Remove unused code from local scope

Machine independent instruction stream optimizer “optim”• Loop invariant hoisting• Range propogation• Constraint check elimination• Reduce memory movement

Machine specific code generator• Peep-hole optimization

All optimization consumes extra CPU during compilation The default is off – OPTIMIZATION_LEVEL: 0

Apex performs optimization in several different places Front End – post semantics

• Common sub-expression elimination• Code in-lining• Loop unrolling• Remove unused code from local scope

Machine independent instruction stream optimizer “optim”• Loop invariant hoisting• Range propogation• Constraint check elimination• Reduce memory movement

Machine specific code generator• Peep-hole optimization

All optimization consumes extra CPU during compilation The default is off – OPTIMIZATION_LEVEL: 0


PlaceWare.com

Example Code – Summation of SQRTExample Code – Summation of SQRT

Simple routine that sums up square roots and prints the result

Simple routine that sums up square roots and prints the result


PlaceWare.com

Example Code – Body of G_E_F.SqrtExample Code – Body of G_E_F.Sqrt


PlaceWare.com

Example Code – Body of G_E_F.HardwareExample Code – Body of G_E_F.Hardware


PlaceWare.com

Optimization Level 0Optimization Level 0

No inlining, no code elimination, no check elimination Disassembly of sum_sqrt.2.ada is 15845 lines long

No unused code has been eliminated – all the code for generic_elementary_functions remains

No inlining, no code elimination, no check elimination Disassembly of sum_sqrt.2.ada is 15845 lines long

No unused code has been eliminated – all the code for generic_elementary_functions remains


PlaceWare.com

Optimization Level 0 – Disassembly of “for” loopOptimization Level 0 – Disassembly of “for” loop


PlaceWare.com

Optimization Level 0 – Disassembly of sqrtOptimization Level 0 – Disassembly of sqrt

163 lines ofassembly

Slightlyabridged

163 lines ofassembly

Slightlyabridged


PlaceWare.com

Optimization Level 0 – Disassembly of hardwareOptimization Level 0 – Disassembly of hardware

56 Lines of disassembly for SQRT 10 Instructions for SQRT_32

56 Lines of disassembly for SQRT 10 Instructions for SQRT_32


PlaceWare.com

Optimization Level 0 – SummaryOptimization Level 0 – Summary Total of over 220 instructions generated for the code that we are

interested in Lots of it will be unused Not to mention the rest of the code for the instantiation

Code maps back to source easily Code layout follows source Lots of overhead for this straightforward code

Subprogram prolog/epilog code• Stack checks• Register management

Subprogram call/return code (3 levels deep) No delayed branch slots being filled

Total of over 220 instructions generated for the code that we are interested in Lots of it will be unused Not to mention the rest of the code for the instantiation

Code maps back to source easily Code layout follows source Lots of overhead for this straightforward code

Subprogram prolog/epilog code• Stack checks• Register management

Subprogram call/return code (3 levels deep) No delayed branch slots being filled


PlaceWare.com

Optimization Level 2 – Disassembly of “for” loopOptimization Level 2 – Disassembly of “for” loop


PlaceWare.com

Optimization Level 2 – ObservationsOptimization Level 2 – Observations

Disassembly of sum_sqrt.2.ada is 85 lines long Entire loop and all the called subprogram code is now

12 instructions long 5 instructions for “for” loop management

• Includes 2 instructions for branching

4 instructions for integer to float conversion• 2 are identical, as one copy is used to fill a delayed branch slot at the bottom of

the loop

1 instruction for the Text_Io code is used to fill a branch delay slot

2 Instructions to perform the actual Sqrt and summation.

Disassembly of sum_sqrt.2.ada is 85 lines long Entire loop and all the called subprogram code is now

12 instructions long 5 instructions for “for” loop management

• Includes 2 instructions for branching

4 instructions for integer to float conversion• 2 are identical, as one copy is used to fill a delayed branch slot at the bottom of

the loop

1 instruction for the Text_Io code is used to fill a branch delay slot

2 Instructions to perform the actual Sqrt and summation.


PlaceWare.com

Optimization Level 2 – ObservationsOptimization Level 2 – Observations The optimization objective was Time

Time is certainly optimized, but Space also benefited enormously Different optimization techniques combined effectively to

produce very effective code Inlining of 3 levels of subprogram call eliminated a significant amount of

subprogram prolog/epilog Range propagation determined that the argument to SQRT could never

be less than zero, which allowed the argument check to be removed Evaluation of compile static expressions resulted in a lot of code not

being generated• Kind of floating point type – no case statement needed• Availability of Hardware SQRT – no call needed to Has_Sqrt

Register lifetime analysis on the resulting code meant that the loop control variable and the summation variable could live in registers

The optimization objective was Time Time is certainly optimized, but Space also benefited enormously

Different optimization techniques combined effectively to produce very effective code Inlining of 3 levels of subprogram call eliminated a significant amount of

subprogram prolog/epilog Range propagation determined that the argument to SQRT could never

be less than zero, which allowed the argument check to be removed Evaluation of compile static expressions resulted in a lot of code not

being generated• Kind of floating point type – no case statement needed• Availability of Hardware SQRT – no call needed to Has_Sqrt

Register lifetime analysis on the resulting code meant that the loop control variable and the summation variable could live in registers


PlaceWare.com

Performing BenchmarksPerforming Benchmarks

Benchmarks usually consist of two distinct loops A “Null Timing” loop to determine the overhead of the loop

code itself The Code Under Test loop which has the same structure as

the Null timing loop with the inside of the loop replaced with the C.U.T

Timing equation looks like TCUT = (TCUT_loop – Tnull_loop) / n

• Where n is the number of iterations• Usually n has to be very high so that the resolution of the system clock is not

significant in the result

Benchmarks usually consist of two distinct loops A “Null Timing” loop to determine the overhead of the loop

code itself The Code Under Test loop which has the same structure as

the Null timing loop with the inside of the loop replaced with the C.U.T

Timing equation looks like TCUT = (TCUT_loop – Tnull_loop) / n

• Where n is the number of iterations• Usually n has to be very high so that the resolution of the system clock is not

significant in the result


PlaceWare.com


One effect we notice is that sometimes a benchmark suite reports slower times for code even though we know we have improved our optimizations!

What’s happening? The Null Timing loops of benchmark suites attempt to defeat

compiler optimizations that skew their results Compilers are better at getting rid of unnecessary code, often

defeating the smart null loop So now the equation looks like: TCUT = (TCUT_loop – 0 ) / n So the remaining loop overhead time gets included in the

time of the Code Under Test making it look worse than before

One effect we notice is that sometimes a benchmark suite reports slower times for code even though we know we have improved our optimizations!

What’s happening? The Null Timing loops of benchmark suites attempt to defeat

compiler optimizations that skew their results Compilers are better at getting rid of unnecessary code, often

defeating the smart null loop So now the equation looks like: TCUT = (TCUT_loop – 0 ) / n So the remaining loop overhead time gets included in the

time of the Code Under Test making it look worse than before


PlaceWare.com


One other effect we observe is that benchmarks often don’t do anything with the results they calculate

Compilers can detect this and conclude that running the code has no effect and (very importantly) no side-effects Range propagation concludes that overflow cannot be raised Result is never used Code is thrown away

A good example is the Henessey Benchmark in the PIWG suite Large matrix multiplications, using a range of values that will not result

in overflow Apex 4.0 reports zero time for that test

One other effect we observe is that benchmarks often don’t do anything with the results they calculate

Compilers can detect this and conclude that running the code has no effect and (very importantly) no side-effects Range propagation concludes that overflow cannot be raised Result is never used Code is thrown away

A good example is the Henessey Benchmark in the PIWG suite Large matrix multiplications, using a range of values that will not result

in overflow Apex 4.0 reports zero time for that test


PlaceWare.com


When trying to compare different compiler technologies you need to look beyond the results printed by a benchmark program Printed numbers can be very misleading Look at absolute times and iteration counts Benchmarks don’t translate well b/n processor variants and

processor types The best benchmark is your application

Or a sizable portion of it

When trying to compare different compiler technologies you need to look beyond the results printed by a benchmark program Printed numbers can be very misleading Look at absolute times and iteration counts Benchmarks don’t translate well b/n processor variants and

processor types The best benchmark is your application

Or a sizable portion of it

Rational Apex 4.0 Optimization

Documents

Transcript of Rational Apex 4.0 Optimization