Download - Simple Interval Calculation bi-linear modelling method. SIC-method Rodionova Oxana [email protected] Semenov Institute of Chemical Physics RAS & Russian.

Simple Interval Calculation bi-linear modelling method.

SIC-method

Rodionova Oxana [email protected]

Semenov Institute of Chemical Physics RAS & Russian Chemometric Society

Stages of Multivariate Data Analysis

Experimental design (DOE)

1. minimizing the total number of experiments

2. obtain as much “information” as possible.

1. validation method

2. proper validation set

Validation

Prediction

accuracy of prediction ?

Modelling

Maximally informative model

Simple Interval Calculation (SIC)

gives the result of the prediction directly in an interval form

Interval calculation Simple

1.simple idea lies in the background

2. well-known mathematical methods are used for its implementation.

ExactExact (errorless) model

InexactInexact (real) model

y=Xa+

y = + X = , X is n p matrix

n – samples; p - variables

Main Assumption of SIC-method

,Prob 00

All errors are limited.

+

Normal () distribution

Finite ( ) distributions Value is

the Maximum Error Deviation (MED)

The Region of Possible Values (RPV)

n

1iii yxSA

),(

yXayRaA p :

-1

-0.5

0

0.5

1

-0.05 0 0.05 0.1

a1

a2

Let (xi,yi) , i=1,…,n – be a calibration sample ( an object)

i - yi i + (1)

yi - xtia yi + (2)

All vectors a, which agree with (2) form a strip S(xi,yi) Rp

- is known

-1

-0.5

0

0.5

1

-0.05 0 0.05 0.1

a1

a2Strip1

-1

-0.5

0

0.5

1

-0.05 0 0.05 0.1

a1

a2 Strip1

Strip2

-1

-0.5

0

0.5

1

-0.05 0 0.05 0.1

a1

a2 Strip1

Strip2Strip3

The RPV A Properties

An example of RPV (heptagon) with vertexes 1, 2, ..7

1. The region A is unbiased Prob{ A}=1

2. The region A is consistent

Prob{A }=1 as n

3. The region A is limitedif and only if rank X=p

y=TPta=Tq

SIC Prediction

0

1

2

3

4

5

1 2 3 4

Test Samples

V-prediction interval

U-test interval

Consider a response prediction for vector x. If parameter a is changed within RPV A

predicted value v=xta belongs to the interval V=[v-,v+]

If is true response value then Prob{ V} =1

)(min avv t

Aa

)(max avv t

Aa

If reference value y for predictor x is known we can consider another interval UU=[u-,u+], Prob{ U} =1, u-=y- , u+=y+

Consider a response prediction for vector x. If parameter a is changed within RPV A

predicted value v=xta belongs to the interval V=[v-,v+]

If is true response value then Prob{ V} =1

)(min avv t

Aa

)(max avv t

Aa

If reference value y for predictor x is known we can consider another interval UU=[u-,u+], Prob{ U} =1, u-=y- , u+=y+

What Can Go Wrong?

Prediction with

-4

-3

-2

-1

0

1

2

3

4

5

6

1 2 3 4

Test Samples

Re

sp

on

se

SIC PCR Test

1 PCs

Prediction with

-4

-3

-2

-1

0

1

2

3

4

5

6

1 2 3 4

Test Samples

Re

spo

nse

SIC PCR Test

2 PCs Prediction with

-4

-3

-2

-1

0

1

2

3

4

5

6

1 2 3 4

Test Samples

Res

po

ns

e 1

SIC PCR Test

3 PCs

“True” values lie outside of the prediction intervals

Prediction intervals are far less than test intervals

Very large prediction intervals

Quality of Prediction

2

vvd

Vy0

Vy1t

,

,

][ )()( 222 uvuv5.0s

)U(

)VU(g

Length

Length

(Half)WIDTH of Prediction Interval

INCLUDE - whether a reference value lies in Prediction IntervalSEPI - Standard

Error of Interval Prediction

OVERLAP a fraction of Test interval, within Prediction interval.

u

u

2

2d y

v

v

Mean Values

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5

Number of PCs

RM

SE

P, R

MS

EP

I, W

idth

, Err

or

0

0.2

0.4

0.6

0.8

1

1.2

Inc

lud

e, O

ve

rla

p

For Test set x1,…, xk

Mean Characteristics for WIDTH, INCLUDE, and OVERLAP are calculated as

k

1iiz

k1


Mean Characteristic for SEPI is calculated as

and called RMSEPI – Root Mean Square Error of Prediction Interval

k

1i

2is

k1


Mean Characteristics for WIDTH, INCLUDE, and OVERLAP are calculated as

k

1iiz

k1




k

1i

2is

k1




k

1i

2is

k1

Experiment

Theory

Distribution of bSIC estimator for n=11.

Unknown . How to Find It?

The Region of Possible Values (RPV) A( is extended continuously with increasing of

)(,)( A0A

The Region of Possible Values (RPV) A( is extended continuously with increasing of

)(,)( A0A

Unbiased estimator

b̂1n

1nbSIC

Unbiased estimator

b̂1n

1nbSIC

bsic distribution (uniform error)

b

1n

1n1

b

1n

1n

1n

1nn)b(f

2n2

)(

)(

bsic distribution (uniform error)

b

1n

1n1

b

1n

1n

1n

1nn)b(f

2n2

)(

)(

There exists a minimum suchthat . This minimum valuemay be taken as an estimator forparameter

)b(A,bminb̂

Octane Rating Example

25

26

JK

L

M

0

0.1

0.2

0.3

0.4

0.5

0.6

1100 1150 1200 1250 1300 1350 1400 1450 1500 1550

Wavelength

Short Training Set (1-24) Long Traing Set (1-26)

Short Test Set (A-I) Long Test Set (A-M)

X-predictors are NIR-measurements (absorbance spectra) over 226 wavelengths,

Y –response is reference measurements of octane number.

Training set =26 samples

Test set =13 samples

Spectral dada

Geometrical shape of RPV for Number of PCs=3,short training set (=1)

Octane Rating Example

86

87

88

89

90

91

92

93

A B C D E F G H I J K L MTest Samples

Oc

tan

e N

um

be

r (s

am

ple

s A

-I)

60

70

80

90

100

110

120

Oc

tan

e N

um

be

r (s

am

ple

s J

-M)

MED

RMSEP

RMSEPI

Width

Include

Overlap

Optimal Number of PCs

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 2 3 4 5 6 7

Number of PCs

Points ( ) are test values with error bars, points () are PCR estimates, bars ( ) are SIC intervals, curves ( ) are borders of PCR confidence intervals.

Points ( ) are test values with error bars, points () are PCR estimates, bars ( ) are SIC intervals, curves ( ) are borders of PCR confidence intervals.

PCR & SIC for short training set, PCs=3

Mean SIC characteristics for β=1.0. Short test set validation

Real-world example

Total number of samples (n) =15 Number of variable (p) =5 Calibration set =11 samples Testing set=4 samples

Prediction of antioxidant activity using DSC measurements

Test set

LTH 2 5 10 15 20(days) (deg/min) (deg/min) (deg/min) (deg/min) (deg/min)

C1 6 193 200 207.1 210.1 209.1C2 1 173.6 179.2 181.7 190.9 193.2C3 2 192.5 203.5 204.4 208.5 212.9C4 18 194 197.7 209.7 212.8 202C5 3 193.4 192.7 199.1 207.9 209.2C6 15 194 197.7 209.7 212.8 205.3C7 1.5 185.8 193.1 199 205.2 209.7C8 2.5 185.8 193.1 199 205.2 207.1C9 3 186 192.1 197 211.3 207C10 3 186 192.1 197 211 208.2C11 5 203 208.5 216.5 222.9 222T1 0.5 185 191.7 197 197.2 211.2T2 17 194 197.7 209.7 212.8 203.1T3 8 186.8 191 208.2 205.1 205.1T4 5 203.9 213.9 220.2 221.4 227.2

SamplesOIT (deg C) for heating rates

Calibration set

Test set

LTH 2 5 10 15 20(days) (deg/min) (deg/min) (deg/min) (deg/min) (deg/min)

C1 6 193 200 207.1 210.1 209.1C2 1 173.6 179.2 181.7 190.9 193.2C3 2 192.5 203.5 204.4 208.5 212.9C4 18 194 197.7 209.7 212.8 202C5 3 193.4 192.7 199.1 207.9 209.2C6 15 194 197.7 209.7 212.8 205.3C7 1.5 185.8 193.1 199 205.2 209.7C8 2.5 185.8 193.1 199 205.2 207.1C9 3 186 192.1 197 211.3 207C10 3 186 192.1 197 211 208.2C11 5 203 208.5 216.5 222.9 222T1 0.5 185 191.7 197 197.2 211.2T2 17 194 197.7 209.7 212.8 203.1T3 8 186.8 191 208.2 205.1 205.1T4 5 203.9 213.9 220.2 221.4 227.2

SamplesOIT (deg C) for heating rates

Calibration set

Score Plot

C1C2

C3

C4

C5

C6

C7

C8

C9 C10C11

T1

T2

T3

T4

-10

-5

0

5

10

-40 -20 0 20 40

PC1

PC2

S

SIC Object Status Theory y=Xa+

y is (n 1) ; X is the (n p); a is (p 1)

n1iy itii ,...,, ax

yi is a point in the response space, yiR1

vector xi is a point in the predictor space (or score space), xiRp

vector a is a point in the parameter space, a Rp

a pair (x, y) – a sample (object) -is a point in Rp+1

Boundary Sample

b)

-2

-1

0

1

2

3

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11

Test Samples

RPV and its boundary samples “Prediction” of the calibration set

A sample (xi, yi) from calibration set is called a boundary sample if there exists parameter a from RPV such that i

ti yax

Insiders, Outsiders, Outliers

A sample (x, y) is called an “insider”if A S(x, y),

i.e.

for any a from A

ytax

A sample (x, y) is called an “insider”if A S(x, y),

i.e.

for any a from A

ytax A sample (x, y) is called an “outlier”if A S(x, y) =

In other words

for any parameter a

yt ax

A sample (x, y) is called an “outsider” if A S(x, y)

i.e. for some a yx t a

A sample (x, y) is called an “outsider” if A S(x, y)

i.e. for some a yx t a

insiders , boundary samples , prediction intervals

regression 90% conf. interval

‘true’ model y=xa

regression line

SIC Object Status Map in Rp+1

The region of absolute outsiders

C11C10C9

C8

C7

C6

C5

C4

C3

C2

C1

T4

T3

T2

T1

-15

-5

5

15

-50 -30 -10 10 30 50

PC1

PC2

Boundary samples (from calibration set)

Calibration samples

Test samples

The border of absolute outsiders

A score (predictor) vector x is called an absolute outsider if A S(x, y) for any y.

The Sample Status in the Response Space

432

1

-3

-2

-1

0

1

2

3

4

5

Test Samples

Re

spo

nse

U(y) – test intervalA sample (x, y) is an outsider if and only if V(x) U(y) V(x)

Samples 2-4

A sample (x, y) is an outlier if and only if V(x) U(y) =

Sample 3

If test sample (x, y) is an insider, then d(x)

Sample 1

A sample (x, y) is an insider if and only if V(x) U(y)

Sample 1

V(x) – prediction interval

A score vector x is absolute outsider if and only if there exists y such that U(y) V(x)

Sample 4

SIC– leverage / SIC–residual

u

u

2

2d y

v

v

SIC– leverage

)(

)(x

xd

h

MED-normalized SIC–residual

SIC–residual

2vv

yyr)()(

),(xx

x

),(

),(yr

yx

x

Leverage – a measure of how far a

data point to the majority

Residual – a measure of the variation that is not taken into account by the model

SIC Object Status Map

Influence plot

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

A score vector x is an absolute outsider if and only if

h(x) >

A sample (x, y) is an insider if and only if

A sample (xi, yi) from calibration set is a boundary sample if and only if

)(),( xx h1y

)(),( iii h1y xx

A sample (x, y) is an insider if and only if

A sample (xi, yi) from calibration set is a boundary sample if and only if

)(),( xx h1y

)(),( iii h1y xx

A sample (x, y) is an outlier if and only if

)(),( xx h1y

A sample (x, y) is an outlier if and only if

)(),( xx h1y

Influence plot

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

Influence plot

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

Influence plot

11

10

9

87

6

54

3

2

1

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

2 PCsInfluence plot

11

10

9

87

6

54

3

2

1

1312

14

15

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

2 PCsInfluence plot

12345

67

8

9

10

11

12

13

14

15

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

3 PCsInfluence plot

12345

67

89

10

11

12

13

14

15

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

4 PCsInfluence plot

1

2

3

45

6

78

9

10

11

1213

14

15

-1.5

-1

-0.5

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5

SIC-LeverageS

IC-R

es

idu

al

2 PCs

The Main Features of the SIC-method

SIC - METHODSIC - METHOD

• gives the result of prediction directly in the interval form.

• calculates the prediction interval irrespective of sample position regarding the model.

• summarizes and processes all errors involved in bi-linear modelling all together and estimates the Maximum Error Deviation for the model

• provides wide possibilities for sample classification and outlier detection