Post on 03-Feb-2022
Groupe de travail PEQUAN (LIP6, UPMC)Paris, France, July 7, 2011
Automatic Generation of Fast and Certified Codefor Polynomial Evaluation
Guillaume RevyDALI project-team
Universite de Perpignan Via DomitiaLIRMM (CNRS: UMR 5506 - UM2)
Joint work with Christophe Mouilleron (LIP, ENS Lyon)
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 1/27
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 2/27
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 2/27
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Applications
FP computations
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 2/27
Motivation
Embedded systems are ubiquitousI microprocessors dedicated to one or a few specific tasksI satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
floating-point arithmeticSoftware implementing
Applications
FP computations
Embedded systems
No FPU
Highly used in audio and video applicationsI demanding on floating-point computations
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 2/27
On the one side: the IEEE 754-2008 standard, ...
Definition of IEEE floating-point arithmetic
I floating-point formats: single precision, double precision, ...I special values: ±0, ±∞, NaNI 4 rounding modes: to nearest even, upward, downward, and toward zeroI mathematical function behavior
→ special input (ex:√−0 =−0)
→ requires / recommends correct rounding
Motivation:
I make computations reproducibleI and make results architecture-independent
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 3/27
... on the other side: the ST231 processor
ICache
DTLB
Mul
Registerfile (64
registers8 read
4 write)
Load
(LSU)
IU IU IU IU
Trapcontroller
4 x SDI
STBus
SDI ports
61 interrupts Debuglink
Peripherals
DebugTimers
3 xcontroller support unit 32-bit
I-sidememorysubsystem
Interrupt
registerPC andbranch
unit
Branch
file
DCache
bufferWrite
Prefetchbuffer
SCU
CMC
STBus
64-bit
registersControl
UTLBMul
D-sidememorysubsystem
StoreUnit
ITLB
Instructionbuffer
ST231 core 4-issue VLIW 32-bit integer processor
→ no FPU
Parallel execution unit
I 4 integer ALUsI 2 pipelined multipliers 32 × 32→ 32
Latencies: ALU = 1 cycle / Mul = 3 cycles
VLIW (Very Long Instruction Word)
→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 4/27
... on the other side: the ST231 processor
ICache
DTLB
Mul
Registerfile (64
registers8 read
4 write)
Load
(LSU)
IU IU IU IU
Trapcontroller
4 x SDI
STBus
SDI ports
61 interrupts Debuglink
Peripherals
DebugTimers
3 xcontroller support unit 32-bit
I-sidememorysubsystem
Interrupt
registerPC andbranch
unit
Branch
file
DCache
bufferWrite
Prefetchbuffer
SCU
CMC
STBus
64-bit
registersControl
UTLBMul
D-sidememorysubsystem
StoreUnit
ITLB
Instructionbuffer
ST231 core 4-issue VLIW 32-bit integer processor
→ no FPU
Parallel execution unit
I 4 integer ALUsI 2 pipelined multipliers 32 × 32→ 32
Latencies: ALU = 1 cycle / Mul = 3 cycles
VLIW (Very Long Instruction Word)
→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 4/27
Towards the generation of fast and certified codes
This work takes mainly part in the context of the development of FLIP
↪→ software support for binary32 floating-point arithmetic on integer processors
Underlying problem: development “by hand”
I long and tedious, error proneI new target ? new floating-point format ?
Current challenge: tools and methodologies for the automatic generationof efficient and certified programs
I optimized for a given format, for the target architecture
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 5/27
Towards the generation of fast and certified codes
This work takes mainly part in the context of the development of FLIP
↪→ software support for binary32 floating-point arithmetic on integer processors
Underlying problem: development “by hand”
I long and tedious, error proneI new target ? new floating-point format ?
}automation and certification
Current challenge: tools and methodologies for the automatic generationof efficient and certified programs
I optimized for a given format, for the target architecture
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 5/27
Towards the generation of fast and certified codes
This work takes mainly part in the context of the development of FLIP
↪→ software support for binary32 floating-point arithmetic on integer processors
Underlying problem: development “by hand”
I long and tedious, error proneI new target ? new floating-point format ?
}automation and certification
Current challenge: tools and methodologies for the automatic generationof efficient and certified programs
I optimized for a given format, for the target architecture
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 5/27
Towards the generation of fast and certified codes
Some works on code generation and transformation:
↪→ code generators: hardware (FloPoCo) and software (Sollya, Metalibm)
↪→ code transformation for increasing numerical accuracy [Martel, 2009]
LEMA project [Lefevre et al., 2010]: language and library↪→ design easily a generation toolchain
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Our tool: CGPE (Code Generation for Polynomial Evaluation)
In the particular case of polynomial evaluation, can we teach computers to write fastand certified codes, for a given target and optimized for a given format?
↪→ adding a systematic certification step
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 6/27
Towards the generation of fast and certified codes
Some works on code generation and transformation:
↪→ code generators: hardware (FloPoCo) and software (Sollya, Metalibm)
↪→ code transformation for increasing numerical accuracy [Martel, 2009]
LEMA project [Lefevre et al., 2010]: language and library↪→ design easily a generation toolchain
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Our tool: CGPE (Code Generation for Polynomial Evaluation)
In the particular case of polynomial evaluation, can we teach computers to write fastand certified codes, for a given target and optimized for a given format?
↪→ adding a systematic certification step
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 6/27
Towards the generation of fast and certified codes
Some works on code generation and transformation:
↪→ code generators: hardware (FloPoCo) and software (Sollya, Metalibm)
↪→ code transformation for increasing numerical accuracy [Martel, 2009]
LEMA project [Lefevre et al., 2010]: language and library↪→ design easily a generation toolchain
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Our tool: CGPE (Code Generation for Polynomial Evaluation)
In the particular case of polynomial evaluation, can we teach computers to write fastand certified codes, for a given target and optimized for a given format?
↪→ adding a systematic certification step
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 6/27
Basic blocks for implementing correctly-rounded operators
no
Floating-point number unpacking
Normalization
Result significand approximation
Rounding condition decision
Correct rounding computation
Result reconstruction
Range reduction
Result sign/exponentcomputation
Special output selection
yes
Special input detection
R {±0, ±∞, NaN}
X
function independent
function dependent
Objectives
→ Low latency, correctly-roundedimplementations
→ ILP exposure
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 7/27
Basic blocks for implementing correctly-rounded operators
no
Floating-point number unpacking
Normalization
Result significand approximation
Rounding condition decision
Correct rounding computation
Result reconstruction
Range reduction
Result sign/exponentcomputation
Special output selection
yes
Special input detection
R {±0, ±∞, NaN}
X
yes
Special input detection
Special output selection
{±0, ±∞, NaN}
automatedFully
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Uniform approach for nth rootsand their reciprocals→ polynomial evaluation
Extension to division
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 7/27
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 8/27
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximantConstraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 8/27
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximantConstraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 8/27
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximantConstraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 8/27
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximantConstraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 8/27
Flowchart for generating efficient and certified C codes
generation
Problem: function to be evaluated
Fast and certified C code
ST231 features
C code Certificate
Computation of polynomial approximant
CGPECGPE
Constraints
Accuracy of approximant andC code
I Sollya
I interval arithmetic (MPFI),Gappa
Low evaluation latency onST231, ILP exposure
I ?
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 8/27
Outline of the talk
1. Background on polynomial evaluation
2. The CGPE tool
3. Experimental results
4. Conclusion
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 9/27
Background on polynomial evaluation
Outline of the talk
1. Background on polynomial evaluation
2. The CGPE tool
3. Experimental results
4. Conclusion
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 10/27
Background on polynomial evaluation
Our objective
Compute fast and certified schemes for evaluating a polynomialP(x ,y) = α+ y ·a(x)
→ using only additions and multiplications
→ reducing the evaluation latency on unbounded parallelism
Evaluation program = main part of the full software implementation
→ dominates the cost
↪→ make it as efficient as possible
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1973), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 11/27
Background on polynomial evaluation
Our objective
Compute fast and certified schemes for evaluating a polynomialP(x ,y) = α+ y ·a(x)
→ using only additions and multiplications
→ reducing the evaluation latency on unbounded parallelism
Evaluation program = main part of the full software implementation
→ dominates the cost
↪→ make it as efficient as possible
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1973), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 11/27
Background on polynomial evaluation
Our objective
Compute fast and certified schemes for evaluating a polynomialP(x ,y) = α+ y ·a(x)
→ using only additions and multiplications
→ reducing the evaluation latency on unbounded parallelism
Evaluation program = main part of the full software implementation
→ dominates the cost
↪→ make it as efficient as possible
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1973), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 11/27
Background on polynomial evaluation
Our objective
Compute fast and certified schemes for evaluating a polynomialP(x ,y) = α+ y ·a(x)
→ using only additions and multiplications
→ reducing the evaluation latency on unbounded parallelism
Evaluation program = main part of the full software implementation
→ dominates the cost
↪→ make it as efficient as possible
Two families of algorithmsI algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1973), ...→ ill-suited in the context of fixed-point arithmetic
I algorithms without coefficient adaptation
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 11/27
Background on polynomial evaluation
Classical evaluation schemes
Naive evaluation
↪→ 3 additions5 multiplications
↪→ latency: 12 cycles
+
a0 +
×
a1 x
+
×
a2 ×
x x
×
a3 ×
x ×
x x
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 12/27
Background on polynomial evaluation
Classical evaluation schemes
Horner’s rule
↪→ 3 additions3 multiplications
↪→ latency: 12 cycles
⊕ optimal in terms of multiplication number[Pan, 1966], [Borodin, 1971],
fully sequential
+
a0 ×
x +
a1 ×
x +
a2 ×
x a3
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 12/27
Background on polynomial evaluation
Classical evaluation schemes
Second-order Horner’s rule
↪→ 3 additions4 multiplications
↪→ latency: 11 cycles
⊕ some ILP exposure
subparts evaluated in a fully sequentialway: at most 2 ways used
+
+
a0 ×
×
x x
a2
×
x +
a1 ×
×
x x
a3
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 12/27
Background on polynomial evaluation
Classical evaluation schemes
Estrin’s rule
↪→ 3 additions4 multiplications
↪→ latency: 8 cycles
⊕ “divide and conquer” strategy
⊕ more ILP exposure
+
×
×
x x
+
a2 ×
x a3
+
a0 ×
x a1
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 12/27
Background on polynomial evaluation
Definition of evaluation schemes
Mathematical expression a0 +a1 · x +a2 · x2 1yCommon parenthesizations (a0 +(a1 · x))+(a2 · (x · x)) 1
y commutativityassociativitydistributivity
Feasible parenthesizations [. . . ] 160y + and × are only commutativein fixed-point arithmetic
Evaluation schemes(a0 +(a1 ·x))+(a2 ·(x ·x)) (a0 +(a1 ·x))+((a2 ·x)·x))
(a0 +(a2 ·(x ·x)))+(a1 ·x) (a0 +((a2 ·x)·x))+(a1 ·x)
a0 +((a1 ·x)+(a2 ·(x ·x))) a0 +((a1 ·x)+((a2 ·x)·x)))
a0 +(a1 +(a2 ·x))·x
7
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 13/27
Background on polynomial evaluation
Definition of evaluation schemes
Mathematical expression a0 +a1 · x +a2 · x2 1yCommon parenthesizations (a0 +(a1 · x))+(a2 · (x · x)) 1
y commutativityassociativitydistributivity
Feasible parenthesizations (a0 +(a1 · x))+(a2 · (x · x)) 160
(a0 +(a1 · x))+((a2 · x) · x)
a0 +((a1 · x))+((a2 · x) · x))
a0 +((a1 +(a2 · x)) · x)
((a1 +(a2 · x)) · x)+a0 . . .
Feasible parenthesizations [. . . ] 160y + and × are only commutativein fixed-point arithmetic
Evaluation schemes(a0 +(a1 ·x))+(a2 ·(x ·x)) (a0 +(a1 ·x))+((a2 ·x)·x))
(a0 +(a2 ·(x ·x)))+(a1 ·x) (a0 +((a2 ·x)·x))+(a1 ·x)
a0 +((a1 ·x)+(a2 ·(x ·x))) a0 +((a1 ·x)+((a2 ·x)·x)))
a0 +(a1 +(a2 ·x))·x
7
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 13/27
Background on polynomial evaluation
Definition of evaluation schemes
Mathematical expression a0 +a1 · x +a2 · x2 1yCommon parenthesizations (a0 +(a1 · x))+(a2 · (x · x)) 1
y commutativityassociativitydistributivity
Feasible parenthesizations [. . . ] 160y + and × are only commutativein fixed-point arithmetic
Evaluation schemes(a0 +(a1 ·x))+(a2 ·(x ·x)) (a0 +(a1 ·x))+((a2 ·x)·x))
(a0 +(a2 ·(x ·x)))+(a1 ·x) (a0 +((a2 ·x)·x))+(a1 ·x)
a0 +((a1 ·x)+(a2 ·(x ·x))) a0 +((a1 ·x)+((a2 ·x)·x)))
a0 +(a1 +(a2 ·x))·x
7
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 13/27
Background on polynomial evaluation
Remarks on polynomial evaluation
There are several other schemes for evaluating a polynomial a(x)
I can be adapted for bivariate polynomial P(x ,y) = α+ y ·a(x)
Constant number of +, while number of × is non-constant
I reducing the latency⇔ increasing the number of × to expose ILPI trade-off latency / number of multiplications
Evaluation error
I different theoretical error boundsI difference between numerical quality in practice [Revy, 2006]
We need a tool for exploring the space of evaluation schemes.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 14/27
Background on polynomial evaluation
Remarks on polynomial evaluation
There are several other schemes for evaluating a polynomial a(x)
I can be adapted for bivariate polynomial P(x ,y) = α+ y ·a(x)
Constant number of +, while number of × is non-constant
I reducing the latency⇔ increasing the number of × to expose ILPI trade-off latency / number of multiplications
Evaluation error
I different theoretical error boundsI difference between numerical quality in practice [Revy, 2006]
We need a tool for exploring the space of evaluation schemes.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 14/27
Background on polynomial evaluation
Remarks on polynomial evaluation
There are several other schemes for evaluating a polynomial a(x)
I can be adapted for bivariate polynomial P(x ,y) = α+ y ·a(x)
Constant number of +, while number of × is non-constant
I reducing the latency⇔ increasing the number of × to expose ILPI trade-off latency / number of multiplications
Evaluation error
I different theoretical error boundsI difference between numerical quality in practice [Revy, 2006]
We need a tool for exploring the space of evaluation schemes.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 14/27
Background on polynomial evaluation
How many schemes for evaluating a polynomial?n µn → a(x) µ′n → α+ y ·a(x)
wn (2n−1)!!
1 1 10
1 1
2 7 481
1 3
3 163 88384
1 15
4 11602 57363910
2 105
5 2334244 122657263474
3 945
6 1304066578 829129658616013
6 10395
7 1972869433837 17125741272619781635
11 135135
8 8012682343669366 1055157310305502607244946
23 2027025
9 86298937651093314877 190070917121184028045719056344
46 34459425
10 2449381767217281163362301 98543690848554380947490522591191672
98 654729075
Two well-known special casesI the number of evaluation schemes for xn [Wedderburn, Etherington]
wn ∼ηξn
n3/2or
{ξ≈ 2.48325
η≈ 0.31877[Otter, 1948],
I the number of evaluation schemes forn
∑i=1
ai est (2n−1)!!∼√
2(
2ne
)n.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 15/27
Background on polynomial evaluation
How many schemes for evaluating a polynomial?n µn → a(x) µ′n → α+ y ·a(x) wn
(2n−1)!!
1 1 10 1
1
2 7 481 1
3
3 163 88384 1
15
4 11602 57363910 2
105
5 2334244 122657263474 3
945
6 1304066578 829129658616013 6
10395
7 1972869433837 17125741272619781635 11
135135
8 8012682343669366 1055157310305502607244946 23
2027025
9 86298937651093314877 190070917121184028045719056344 46
34459425
10 2449381767217281163362301 98543690848554380947490522591191672 98
654729075
Two well-known special casesI the number of evaluation schemes for xn [Wedderburn, Etherington]
wn ∼ηξn
n3/2or
{ξ≈ 2.48325
η≈ 0.31877[Otter, 1948],
I the number of evaluation schemes forn
∑i=1
ai est (2n−1)!!∼√
2(
2ne
)n.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 15/27
Background on polynomial evaluation
How many schemes for evaluating a polynomial?n µn → a(x) µ′n → α+ y ·a(x) wn (2n−1)!!
1 1 10 1 1
2 7 481 1 3
3 163 88384 1 15
4 11602 57363910 2 105
5 2334244 122657263474 3 945
6 1304066578 829129658616013 6 10395
7 1972869433837 17125741272619781635 11 135135
8 8012682343669366 1055157310305502607244946 23 2027025
9 86298937651093314877 190070917121184028045719056344 46 34459425
10 2449381767217281163362301 98543690848554380947490522591191672 98 654729075
Two well-known special casesI the number of evaluation schemes for xn [Wedderburn, Etherington]
wn ∼ηξn
n3/2or
{ξ≈ 2.48325
η≈ 0.31877[Otter, 1948],
I the number of evaluation schemes forn
∑i=1
ai est (2n−1)!!∼√
2(
2ne
)n.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 15/27
Background on polynomial evaluation
Schemes of low evaluation latency
What is the latency of degree-5 evaluation schemes?
10
100
1000
10000
100000
1e+06
10 11 12 13 14 15 16 17 18 19 20
Num
ber
ofdeg
ree-
5ev
aluat
ion
schem
es
Latency on unbounded parallelism (# cycles)
Total number of schemes: 2334244
minimal latency for degree-5 univariate polynomial: 10 cycles
number of schemes of minimal latency: 36
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 16/27
The CGPE tool
Outline of the talk
1. Background on polynomial evaluation
2. The CGPE tool
3. Experimental results
4. Conclusion
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 17/27
The CGPE tool
Overview of CGPE and related works
Goal of CGPE [Mouilleron and Revy, 2011]: automate the design of fastand certified C codes for evaluating univariate/bivariate polynomials
I in fixed-point arithmeticI by using the target architecture features (as much as possible)
Remarks:
fast = that reduces the evaluation latency on a given target
certified = we can bound the error entailed by the evaluation within thegiven target’s arithmetic
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 18/27
The CGPE tool
Overview of CGPE and related works
Some related worksI [Cheung et al., 2005] and [Lee and Villasenor, 2009]: methodology for
implementing automatically mathematical function in a given precision
based on small degree polynomial evaluation using Horner’s→ no ILP
I [Harrison et al., 1999]: method for generating optimal evaluation scheme toevaluate univariate polynomials on Itanium R© using fma
ST231 has only addition and multiplication, but no fma
I [Green, 2002]: brute force method for generating polynomial evaluationschemes using at best SIMD instructions of the processor ofPlayStation R© 2
objective: generation at compile-time→ brute force method is unfeasible
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 19/27
The CGPE tool
Global architecture of CGPE
Input of CGPE
1. polynomial coefficients and variables: value intervals, fixed-point format, ...
2. set of criteria: maximum error bound and bound on latency (or the lowest)
3. delay of one of the variable
4. some architectural constraints: operator cost, parallelism, ...
CGPE works in two steps:
1. computation of evaluation schemes: reducing evaluation latency onunbounded parallelism and exposing as much ILP as possible
2. selection among the generated schemes, according to different criteria:
• evaluation using only unsigned fixed-point arithmetic
• scheduling feasible on ST231
• evaluation error bound satisfying the required error bound
At the end: CGPE automatically writes C codes with accuracy certificates
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 20/27
The CGPE tool
Global architecture of CGPE
Input of CGPE
1. polynomial coefficients and variables: value intervals, fixed-point format, ...
2. set of criteria: maximum error bound and bound on latency (or the lowest)
3. delay of one of the variable
4. some architectural constraints: operator cost, parallelism, ...
CGPE works in two steps:
1. computation of evaluation schemes: reducing evaluation latency onunbounded parallelism and exposing as much ILP as possible
2. selection among the generated schemes, according to different criteria:
• evaluation using only unsigned fixed-point arithmetic
• scheduling feasible on ST231
• evaluation error bound satisfying the required error bound
At the end: CGPE automatically writes C codes with accuracy certificates
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 20/27
The CGPE tool
Heuristics in DAG set computation
Determination of the minimal target latency on unbounded parallelism
I gives a good estimation of the best evaluation latency of the polynomial onthe target architecture
I takes some problem parameters (operator costs, delay, ...) into account
Non exhaustive computation of evaluation schemes
I elimination of the schemes that do not satisfy latency constraintI limitation to some splittings: evaluation of high and low parts separatelyI restriction to N schemes at each step of the computation
At the end of the computation: set of DAG evaluating the input polynomialand satisfying the latency constraint on unbounded parallelism.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 21/27
The CGPE tool
Heuristics in DAG set computation
Determination of the minimal target latency on unbounded parallelism
I gives a good estimation of the best evaluation latency of the polynomial onthe target architecture
I takes some problem parameters (operator costs, delay, ...) into account
Non exhaustive computation of evaluation schemes
I elimination of the schemes that do not satisfy latency constraintI limitation to some splittings: evaluation of high and low parts separatelyI restriction to N schemes at each step of the computation
At the end of the computation: set of DAG evaluating the input polynomialand satisfying the latency constraint on unbounded parallelism.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 21/27
The CGPE tool
Heuristics in DAG set computation
Determination of the minimal target latency on unbounded parallelism
I gives a good estimation of the best evaluation latency of the polynomial onthe target architecture
I takes some problem parameters (operator costs, delay, ...) into account
Non exhaustive computation of evaluation schemes
I elimination of the schemes that do not satisfy latency constraintI limitation to some splittings: evaluation of high and low parts separatelyI restriction to N schemes at each step of the computation
At the end of the computation: set of DAG evaluating the input polynomialand satisfying the latency constraint on unbounded parallelism.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 21/27
The CGPE tool
Filters for adding numerical constraints
1. Arithmetic operator choice
I ensure that all intermediate variables are of constant signI avoid an extra cost due to sign handling / gain 1 bit of accuracy
2. Scheduling on a simplified model of the target (like the ST231)
I constraints of architecture: cost of operators, instruction bundling, ...I delays on variables
3. Evaluation error bound checking
I straightline polynomial evaluation programI “C code certification” using Gappa we can bound the evaluation error in integer arithmetic
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 22/27
The CGPE tool
Filters for adding numerical constraints
1. Arithmetic operator choice
I ensure that all intermediate variables are of constant signI avoid an extra cost due to sign handling / gain 1 bit of accuracy
2. Scheduling on a simplified model of the target (like the ST231)
I constraints of architecture: cost of operators, instruction bundling, ...I delays on variables
3. Evaluation error bound checking
I straightline polynomial evaluation programI “C code certification” using Gappa we can bound the evaluation error in integer arithmetic
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 22/27
The CGPE tool
Filters for adding numerical constraints
1. Arithmetic operator choice
I ensure that all intermediate variables are of constant signI avoid an extra cost due to sign handling / gain 1 bit of accuracy
2. Scheduling on a simplified model of the target (like the ST231)
I constraints of architecture: cost of operators, instruction bundling, ...I delays on variables
3. Evaluation error bound checking
I straightline polynomial evaluation programI “C code certification” using Gappa we can bound the evaluation error in integer arithmetic
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 22/27
Experimental results
Outline of the talk
1. Background on polynomial evaluation
2. The CGPE tool
3. Experimental results
4. Conclusion
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 23/27
Experimental results
Timings for certified code generation
x1/2 x−1/2 x1/3 x−1/3 log2(1+ x) 1√1+x2
exp(1+x)1+x
Degree (dx ,dy ) (8,1) (9,1) (8,1) (9,1) (6,0) (7,0) (10,0)
Target / Minimal latency 13 / 13 13 / ? 16 / 16 16 / 16 10 / 11 10 / 11 13 / 13
Achieved latency 13 14 16 16 11 11 13
Scheme computation 195ms 73ms 26s 25s 17ms 10ms 40ms
[50] [50] [50] [50] [50] [50] [50]
Arithmetic operator choice 3ms 3ms 7ms 11ms 1ms 2ms 3ms
[35] [29] [30] [26] [2] [12] [27]
Scheduling checking 16s 1m33s 43ms 439ms 2ms 64ms 49s
[11] [1] [30] [24] [1] [5] [5]
Certification (Gappa) 10s 1s 27s 27s 230ms 1s 7s
[11] [1] [30] [24] [1] [5] [4]
Total time (≈) 27s 1m35s 55s 53s 1s 2s 57s
1. Impact of the target latency on the first step of the generation
2. What may dominate the cost: scheduling and certification using Gappa
3. Optimality of some generated codes, in terms of evaluation latency
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 24/27
Experimental results
Timings for certified code generation
x1/2 x−1/2 x1/3 x−1/3 log2(1+ x) 1√1+x2
exp(1+x)1+x
Degree (dx ,dy ) (8,1) (9,1) (8,1) (9,1) (6,0) (7,0) (10,0)
Target / Minimal latency 13 / 13 13 / ? 16 / 16 16 / 16 10 / 11 10 / 11 13 / 13
Achieved latency 13 14 16 16 11 11 13
Scheme computation 195ms 73ms 26s 25s 17ms 10ms 40ms
[50] [50] [50] [50] [50] [50] [50]
Arithmetic operator choice 3ms 3ms 7ms 11ms 1ms 2ms 3ms
[35] [29] [30] [26] [2] [12] [27]
Scheduling checking 16s 1m33s 43ms 439ms 2ms 64ms 49s
[11] [1] [30] [24] [1] [5] [5]
Certification (Gappa) 10s 1s 27s 27s 230ms 1s 7s
[11] [1] [30] [24] [1] [5] [4]
Total time (≈) 27s 1m35s 55s 53s 1s 2s 57s
1. Impact of the target latency on the first step of the generation
2. What may dominate the cost: scheduling and certification using Gappa
3. Optimality of some generated codes, in terms of evaluation latency
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 24/27
Experimental results
Timings for certified code generation
x1/2 x−1/2 x1/3 x−1/3 log2(1+ x) 1√1+x2
exp(1+x)1+x
Degree (dx ,dy ) (8,1) (9,1) (8,1) (9,1) (6,0) (7,0) (10,0)
Target / Minimal latency 13 / 13 13 / ? 16 / 16 16 / 16 10 / 11 10 / 11 13 / 13
Achieved latency 13 14 16 16 11 11 13
Scheme computation 195ms 73ms 26s 25s 17ms 10ms 40ms
[50] [50] [50] [50] [50] [50] [50]
Arithmetic operator choice 3ms 3ms 7ms 11ms 1ms 2ms 3ms
[35] [29] [30] [26] [2] [12] [27]
Scheduling checking 16s 1m33s 43ms 439ms 2ms 64ms 49s
[11] [1] [30] [24] [1] [5] [5]
Certification (Gappa) 10s 1s 27s 27s 230ms 1s 7s
[11] [1] [30] [24] [1] [5] [4]
Total time (≈) 27s 1m35s 55s 53s 1s 2s 57s
1. Impact of the target latency on the first step of the generation
2. What may dominate the cost: scheduling and certification using Gappa
3. Optimality of some generated codes, in terms of evaluation latency
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 24/27
Experimental results
Timings for certified code generation
x1/2 x−1/2 x1/3 x−1/3 log2(1+ x) 1√1+x2
exp(1+x)1+x
Degree (dx ,dy ) (8,1) (9,1) (8,1) (9,1) (6,0) (7,0) (10,0)
Target / Minimal latency 13 / 13 13 / ? 16 / 16 16 / 16 10 / 11 10 / 11 13 / 13
Achieved latency 13 14 16 16 11 11 13
Scheme computation 195ms 73ms 26s 25s 17ms 10ms 40ms
[50] [50] [50] [50] [50] [50] [50]
Arithmetic operator choice 3ms 3ms 7ms 11ms 1ms 2ms 3ms
[35] [29] [30] [26] [2] [12] [27]
Scheduling checking 16s 1m33s 43ms 439ms 2ms 64ms 49s
[11] [1] [30] [24] [1] [5] [5]
Certification (Gappa) 10s 1s 27s 27s 230ms 1s 7s
[11] [1] [30] [24] [1] [5] [4]
Total time (≈) 27s 1m35s 55s 53s 1s 2s 57s
1. Impact of the target latency on the first step of the generation
2. What may dominate the cost: scheduling and certification using Gappa
3. Optimality of some generated codes, in terms of evaluation latency
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 24/27
Conclusion
Outline of the talk
1. Background on polynomial evaluation
2. The CGPE tool
3. Experimental results
4. Conclusion
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 25/27
Conclusion
Conclusions
Code generation for fast and certified polynomial evaluationI in fixed-point arithmeticI methodologies and tools to automate polynomial evaluation implementationI heuristics and techniques for generating quickly fast and certified C codes
I implemented in the tool CGPE (Code Generation for Polynomial Evaluation)
http://cgpe.gforge.inria.fr/
Speed-up significantly the development time of mathematical library
I CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 26/27
Conclusion
Current work and perspectives
Current work
I precomputation in order to help the DAG set computation in choosing theappropriate splittings: the ones leading to DAGs with optimal latency onunbounded parallelism
I earlier DAG elimination, by checking accuracy during generation step
Perspectives
I extend CGPE to handle floating-point arithmetic,
I make CGPE more general to tackle other problems, like evaluation of apolynomial at a matrix point.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 27/27
Conclusion
Current work and perspectives
Current work
I precomputation in order to help the DAG set computation in choosing theappropriate splittings: the ones leading to DAGs with optimal latency onunbounded parallelism
I earlier DAG elimination, by checking accuracy during generation step
Perspectives
I extend CGPE to handle floating-point arithmetic,
I make CGPE more general to tackle other problems, like evaluation of apolynomial at a matrix point.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 27/27
Conclusion
Borodin, A. (1971).
Horner’s rule is uniquely optimal.
Theory of machines and computations, pages 45–58.
Cheung, R. C. C., Lee, D.-U., Mencer, O., Luk, W., and Cheung, P. Y. K. (2005).
Automating custom-precision function evaluation for embedded processors.
In CASES’05: Proceedings of the 2005 international conference on Compilers,architectures and synthesis for embedded systems, pages 22–31, New York, NY,USA. ACM.
Green, R. (2002).
Faster Math Functions.
Tutorial at Game Developers Conference.
Harrison, J., Kubaska, T., Story, S., and Tang, P. (1999).
The computation of transcendental functions on the IA-64 architecture.
Intel Technology Journal, 1999-Q4:1–7.
Lee, D.-U. and Villasenor, J. D. (2009).
Optimized Custom Precision Function Evaluation for Embedded Processors.Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 27/27
Conclusion
IEEE Transactions on Computers, 58(1):46–59.
Lefevre, V., Theveny, P., de Dinechin, F., Jeannerod, C.-P., Mouilleron, C.,Pfannholzer, D., and Revol, N. (2010).
LEMA: towards a language for reliable arithmetic.
ACM Communications in Computer Algebra, 44:41–52.
Martel, M. (2009).
Enhancing the Implementation of Mathematical Formulas for Fixed-Point andFloating-Point Arithmetics.
In Journal of Formal Methods in System Design, volume 35, pages 265–278.Springer.
Mouilleron, C. and Revy, G. (2011).
Automatic Generation of Fast and Certified Code for Polynomial Evaluation.
In Proc. of the 20th IEEE Symposium on Computer Arithmetic (ARITH’20),Tuebingen, Germany.
Otter, R. (1948).
The number of trees.
The Annals of Mathematics, 49(3):pp. 583–599.Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 27/27
Conclusion
Pan, V. Y. (1966).
Methods of Computing Values of Polynomials.
Russian Mathematical Surveys, 21(1):105–136.
Revy, G. (2006).
Analyse et implantation d’algorithmes rapides pour l’evaluation polynomiale surles nombres flottants.
Master’s thesis, Ecole normale superieure de Lyon, 46 allee d’Italie, F-69364Lyon cedex 07, France.
Guillaume Revy (Groupe de travail PEQUAN – July 7, 2011) Automatic Generation of Fast and Certified Code for Polynomial Evaluation 27/27