Simple Interval Calculation bi-linear modelling method.
SIC-method
Rodionova Oxana [email protected]
Semenov Institute of Chemical Physics RAS & Russian Chemometric Society
Stages of Multivariate Data Analysis
Experimental design (DOE)
1. minimizing the total number of experiments
2. obtain as much “information” as possible.
1. validation method
2. proper validation set
Validation
Prediction
accuracy of prediction ?
Modelling
Maximally informative model
Simple Interval Calculation (SIC)
gives the result of the prediction directly in an interval form
Interval calculation Simple
1.simple idea lies in the background
2. well-known mathematical methods are used for its implementation.
ExactExact (errorless) model
InexactInexact (real) model
y=Xa+
y = + X = , X is n p matrix
n – samples; p - variables
Main Assumption of SIC-method
,Prob 00
All errors are limited.
+
Normal () distribution
Finite ( ) distributions Value is
the Maximum Error Deviation (MED)
The Region of Possible Values (RPV)
n
1iii yxSA
),(
yXayRaA p :
-1
-0.5
0
0.5
1
-0.05 0 0.05 0.1
a1
a2
Let (xi,yi) , i=1,…,n – be a calibration sample ( an object)
i - yi i + (1)
yi - xtia yi + (2)
All vectors a, which agree with (2) form a strip S(xi,yi) Rp
- is known
-1
-0.5
0
0.5
1
-0.05 0 0.05 0.1
a1
a2Strip1
-1
-0.5
0
0.5
1
-0.05 0 0.05 0.1
a1
a2 Strip1
Strip2
-1
-0.5
0
0.5
1
-0.05 0 0.05 0.1
a1
a2 Strip1
Strip2Strip3
The RPV A Properties
An example of RPV (heptagon) with vertexes 1, 2, ..7
1. The region A is unbiased Prob{ A}=1
2. The region A is consistent
Prob{A }=1 as n
3. The region A is limitedif and only if rank X=p
y=TPta=Tq
SIC Prediction
0
1
2
3
4
5
1 2 3 4
Test Samples
V-prediction interval
U-test interval
Consider a response prediction for vector x. If parameter a is changed within RPV A
predicted value v=xta belongs to the interval V=[v-,v+]
If is true response value then Prob{ V} =1
)(min avv t
Aa
)(max avv t
Aa
If reference value y for predictor x is known we can consider another interval UU=[u-,u+], Prob{ U} =1, u-=y- , u+=y+
Consider a response prediction for vector x. If parameter a is changed within RPV A
predicted value v=xta belongs to the interval V=[v-,v+]
If is true response value then Prob{ V} =1
)(min avv t
Aa
)(max avv t
Aa
If reference value y for predictor x is known we can consider another interval UU=[u-,u+], Prob{ U} =1, u-=y- , u+=y+
What Can Go Wrong?
Prediction with
-4
-3
-2
-1
0
1
2
3
4
5
6
1 2 3 4
Test Samples
Re
sp
on
se
SIC PCR Test
1 PCs
Prediction with
-4
-3
-2
-1
0
1
2
3
4
5
6
1 2 3 4
Test Samples
Re
spo
nse
SIC PCR Test
2 PCs Prediction with
-4
-3
-2
-1
0
1
2
3
4
5
6
1 2 3 4
Test Samples
Res
po
ns
e 1
SIC PCR Test
3 PCs
“True” values lie outside of the prediction intervals
Prediction intervals are far less than test intervals
Very large prediction intervals
Quality of Prediction
2
vvd
Vy0
Vy1t
,
,
][ )()( 222 uvuv5.0s
)U(
)VU(g
Length
Length
(Half)WIDTH of Prediction Interval
INCLUDE - whether a reference value lies in Prediction IntervalSEPI - Standard
Error of Interval Prediction
OVERLAP a fraction of Test interval, within Prediction interval.
u
u
2
2d y
v
v
Mean Values
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5
Number of PCs
RM
SE
P, R
MS
EP
I, W
idth
, Err
or
0
0.2
0.4
0.6
0.8
1
1.2
Inc
lud
e, O
ve
rla
p
For Test set x1,…, xk
Mean Characteristics for WIDTH, INCLUDE, and OVERLAP are calculated as
k
1iiz
k1
For Test set x1,…, xk
Mean Characteristic for SEPI is calculated as
and called RMSEPI – Root Mean Square Error of Prediction Interval
k
1i
2is
k1
For Test set x1,…, xk
Mean Characteristics for WIDTH, INCLUDE, and OVERLAP are calculated as
k
1iiz
k1
For Test set x1,…, xk
Mean Characteristic for SEPI is calculated as
and called RMSEPI – Root Mean Square Error of Prediction Interval
k
1i
2is
k1
For Test set x1,…, xk
Mean Characteristic for SEPI is calculated as
and called RMSEPI – Root Mean Square Error of Prediction Interval
k
1i
2is
k1
Experiment
Theory
Distribution of bSIC estimator for n=11.
Unknown . How to Find It?
The Region of Possible Values (RPV) A( is extended continuously with increasing of
)(,)( A0A
The Region of Possible Values (RPV) A( is extended continuously with increasing of
)(,)( A0A
Unbiased estimator
b̂1n
1nbSIC
Unbiased estimator
b̂1n
1nbSIC
bsic distribution (uniform error)
b
1n
1n1
b
1n
1n
1n
1nn)b(f
2n2
)(
)(
bsic distribution (uniform error)
b
1n
1n1
b
1n
1n
1n
1nn)b(f
2n2
)(
)(
There exists a minimum suchthat . This minimum valuemay be taken as an estimator forparameter
)b(A,bminb̂
Octane Rating Example
25
26
JK
L
M
0
0.1
0.2
0.3
0.4
0.5
0.6
1100 1150 1200 1250 1300 1350 1400 1450 1500 1550
Wavelength
Short Training Set (1-24) Long Traing Set (1-26)
Short Test Set (A-I) Long Test Set (A-M)
X-predictors are NIR-measurements (absorbance spectra) over 226 wavelengths,
Y –response is reference measurements of octane number.
Training set =26 samples
Test set =13 samples
Spectral dada
Geometrical shape of RPV for Number of PCs=3,short training set (=1)
Octane Rating Example
86
87
88
89
90
91
92
93
A B C D E F G H I J K L MTest Samples
Oc
tan
e N
um
be
r (s
am
ple
s A
-I)
60
70
80
90
100
110
120
Oc
tan
e N
um
be
r (s
am
ple
s J
-M)
MED
RMSEP
RMSEPI
Width
Include
Overlap
Optimal Number of PCs
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1 2 3 4 5 6 7
Number of PCs
Points ( ) are test values with error bars, points () are PCR estimates, bars ( ) are SIC intervals, curves ( ) are borders of PCR confidence intervals.
Points ( ) are test values with error bars, points () are PCR estimates, bars ( ) are SIC intervals, curves ( ) are borders of PCR confidence intervals.
PCR & SIC for short training set, PCs=3
Mean SIC characteristics for β=1.0. Short test set validation
Real-world example
Total number of samples (n) =15 Number of variable (p) =5 Calibration set =11 samples Testing set=4 samples
Prediction of antioxidant activity using DSC measurements
Test set
LTH 2 5 10 15 20(days) (deg/min) (deg/min) (deg/min) (deg/min) (deg/min)
C1 6 193 200 207.1 210.1 209.1C2 1 173.6 179.2 181.7 190.9 193.2C3 2 192.5 203.5 204.4 208.5 212.9C4 18 194 197.7 209.7 212.8 202C5 3 193.4 192.7 199.1 207.9 209.2C6 15 194 197.7 209.7 212.8 205.3C7 1.5 185.8 193.1 199 205.2 209.7C8 2.5 185.8 193.1 199 205.2 207.1C9 3 186 192.1 197 211.3 207C10 3 186 192.1 197 211 208.2C11 5 203 208.5 216.5 222.9 222T1 0.5 185 191.7 197 197.2 211.2T2 17 194 197.7 209.7 212.8 203.1T3 8 186.8 191 208.2 205.1 205.1T4 5 203.9 213.9 220.2 221.4 227.2
SamplesOIT (deg C) for heating rates
Calibration set
Test set
LTH 2 5 10 15 20(days) (deg/min) (deg/min) (deg/min) (deg/min) (deg/min)
C1 6 193 200 207.1 210.1 209.1C2 1 173.6 179.2 181.7 190.9 193.2C3 2 192.5 203.5 204.4 208.5 212.9C4 18 194 197.7 209.7 212.8 202C5 3 193.4 192.7 199.1 207.9 209.2C6 15 194 197.7 209.7 212.8 205.3C7 1.5 185.8 193.1 199 205.2 209.7C8 2.5 185.8 193.1 199 205.2 207.1C9 3 186 192.1 197 211.3 207C10 3 186 192.1 197 211 208.2C11 5 203 208.5 216.5 222.9 222T1 0.5 185 191.7 197 197.2 211.2T2 17 194 197.7 209.7 212.8 203.1T3 8 186.8 191 208.2 205.1 205.1T4 5 203.9 213.9 220.2 221.4 227.2
SamplesOIT (deg C) for heating rates
Calibration set
Score Plot
C1C2
C3
C4
C5
C6
C7
C8
C9 C10C11
T1
T2
T3
T4
-10
-5
0
5
10
-40 -20 0 20 40
PC1
PC2
S
SIC Object Status Theory y=Xa+
y is (n 1) ; X is the (n p); a is (p 1)
n1iy itii ,...,, ax
yi is a point in the response space, yiR1
vector xi is a point in the predictor space (or score space), xiRp
vector a is a point in the parameter space, a Rp
a pair (x, y) – a sample (object) -is a point in Rp+1
Boundary Sample
b)
-2
-1
0
1
2
3
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
Test Samples
RPV and its boundary samples “Prediction” of the calibration set
A sample (xi, yi) from calibration set is called a boundary sample if there exists parameter a from RPV such that i
ti yax
Insiders, Outsiders, Outliers
A sample (x, y) is called an “insider”if A S(x, y),
i.e.
for any a from A
ytax
A sample (x, y) is called an “insider”if A S(x, y),
i.e.
for any a from A
ytax A sample (x, y) is called an “outlier”if A S(x, y) =
In other words
for any parameter a
yt ax
A sample (x, y) is called an “outsider” if A S(x, y)
i.e. for some a yx t a
A sample (x, y) is called an “outsider” if A S(x, y)
i.e. for some a yx t a
insiders , boundary samples , prediction intervals
regression 90% conf. interval
‘true’ model y=xa
regression line
SIC Object Status Map in Rp+1
The region of absolute outsiders
C11C10C9
C8
C7
C6
C5
C4
C3
C2
C1
T4
T3
T2
T1
-15
-5
5
15
-50 -30 -10 10 30 50
PC1
PC2
Boundary samples (from calibration set)
Calibration samples
Test samples
The border of absolute outsiders
A score (predictor) vector x is called an absolute outsider if A S(x, y) for any y.
The Sample Status in the Response Space
432
1
-3
-2
-1
0
1
2
3
4
5
Test Samples
Re
spo
nse
U(y) – test intervalA sample (x, y) is an outsider if and only if V(x) U(y) V(x)
Samples 2-4
A sample (x, y) is an outlier if and only if V(x) U(y) =
Sample 3
If test sample (x, y) is an insider, then d(x)
Sample 1
A sample (x, y) is an insider if and only if V(x) U(y)
Sample 1
V(x) – prediction interval
A score vector x is absolute outsider if and only if there exists y such that U(y) V(x)
Sample 4
SIC– leverage / SIC–residual
u
u
2
2d y
v
v
SIC– leverage
)(
)(x
xd
h
MED-normalized SIC–residual
SIC–residual
2vv
yyr)()(
),(xx
x
),(
),(yr
yx
x
Leverage – a measure of how far a
data point to the majority
Residual – a measure of the variation that is not taken into account by the model
SIC Object Status Map
Influence plot
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
A score vector x is an absolute outsider if and only if
h(x) >
A sample (x, y) is an insider if and only if
A sample (xi, yi) from calibration set is a boundary sample if and only if
)(),( xx h1y
)(),( iii h1y xx
A sample (x, y) is an insider if and only if
A sample (xi, yi) from calibration set is a boundary sample if and only if
)(),( xx h1y
)(),( iii h1y xx
A sample (x, y) is an outlier if and only if
)(),( xx h1y
A sample (x, y) is an outlier if and only if
)(),( xx h1y
Influence plot
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
Influence plot
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
Influence plot
11
10
9
87
6
54
3
2
1
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
2 PCsInfluence plot
11
10
9
87
6
54
3
2
1
1312
14
15
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
2 PCsInfluence plot
12345
67
8
9
10
11
12
13
14
15
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
3 PCsInfluence plot
12345
67
89
10
11
12
13
14
15
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
4 PCsInfluence plot
1
2
3
45
6
78
9
10
11
1213
14
15
-1.5
-1
-0.5
0
0.5
1
1.5
0 0.5 1 1.5 2 2.5
SIC-LeverageS
IC-R
es
idu
al
2 PCs
The Main Features of the SIC-method
SIC - METHODSIC - METHOD
• gives the result of prediction directly in the interval form.
• calculates the prediction interval irrespective of sample position regarding the model.
• summarizes and processes all errors involved in bi-linear modelling all together and estimates the Maximum Error Deviation for the model
• provides wide possibilities for sample classification and outlier detection
Top Related