NIR Shootout 2002 - Eigenvector Peak...3/26/2015 3 150 160 170 180 190 200 210 220 230 240 150 160...

14
3/26/2015 1 Automated Peak and Peak-Ratio Selection for Regression and Classification Models of Raman and LIBS Data Jeremy M. Shaver Eigenvector Research, Inc. Brian Marquardt, Tom Dearing, Sergey Mozharov, MarqMetrix NIR Shootout 2002 2002 International Diffuse Reflectance Conference (IDRC) "Shootout" data NIR spectra 654 pharmaceutical tablets Calibration Set, Validation Set, Test Set Two spectrometers Goal: best model with calibration transfer Won by Karl Norris using "Norris Regression" – selected peaks and peak ratios including gap- segment derivative

Transcript of NIR Shootout 2002 - Eigenvector Peak...3/26/2015 3 150 160 170 180 190 200 210 220 230 240 150 160...

  • 3/26/2015

    1

    Automated Peak and Peak-Ratio

    Selection for Regression and

    Classification Models of

    Raman and LIBS Data

    Jeremy M. Shaver Eigenvector Research, Inc.

    Brian Marquardt, Tom Dearing, Sergey Mozharov, MarqMetrix

    NIR Shootout 2002

    • 2002 International Diffuse Reflectance Conference (IDRC) "Shootout" data

    – NIR spectra

    – 654 pharmaceutical tablets

    – Calibration Set, Validation Set, Test Set

    – Two spectrometers

    – Goal: best model with calibration transfer

    • Won by Karl Norris using "Norris Regression" –selected peaks and peak ratios including gap-segment derivative

  • 3/26/2015

    2

    Norris' "Winning" Model

    Term 1 Term 2

    Wavelength Smooth Gap Wavelength Smooth Gap

    Numerator 1142 nm 10 nm 26 nm 1338 nm 0 nm 22 nm

    Denominator 920 nm 0 nm 30 nm

    600 800 1000 1200 1400 1600 18002

    2.5

    3

    3.5

    4

    4.5

    5

    5.5

    6

    6.5

    Wavelength

    Interactive manual selection of regions and smooth/gap parameters

    150 160 170 180 190 200 210 220 230 240140

    150

    160

    170

    180

    190

    200

    210

    220

    230

    240

    Y Measured 3 assay

    Y P

    redic

    ted 3

    assay

    RMSEC: 2.9

    RMSECV: 3.0

    RMSEP: 2.8 2.8

    Results using (Approx.) Norris Regression

    Preprocessing: 2nd Derivative (gap: 18 nm, segment: 6 nm) +

    Integrate + Autoscale

    validation sets

    (both instruments)

    calibration data

    Note: Test set covers same range as calibration data

  • 3/26/2015

    3

    150 160 170 180 190 200 210 220 230 240150

    160

    170

    180

    190

    200

    210

    220

    230

    240

    250

    Y Measured 3 assay

    Y P

    redic

    ted 3

    assay

    Samples/Scores Plot of c & Test

    R^2 = 0.9683 Latent VariablesRMSEC = 2.585RMSECV = 2.6755RMSEP = 4.0273Calibration Bias = 0CV Bias = -0.018797Prediction Bias = 1.345

    EMSC + 1st Derivative + Mean Centering

    RMSEC: 2.6

    RMSECV: 2.7

    RMSEP: 3.3 4.8

    EMSC = Extended Multiplicative Scatter Correction

    Tabulated Results

    RMSEC RMSECV Val 1 Val 2 Test 1 Test 2

    Norris Regression 2.7 2.7 2.8 2.8 3.0 3.3

    Expert-Selected

    Preprocessing2.6 2.7 3.3 4.8 2.8 4.2

    Good Model… but Bad Transfer

  • 3/26/2015

    4

    Norris Regression – GenericallyNon-linear Regression

    (Gap-Segment 1st Derivative)

    (Peak Normalization)

    (Peak Normalization with

    variable- gap 1st derivative)

    Binary Encoding of Norris Equations

    • Example for 5 variables: [ x1 x2 x3 x4 x5 ]

    xi

    0 0 0 0 0 0 1 0 0 0 1 0 0 0 10 0 1 0 0 0 0 0 0 0 0 0 0 1 0

    xi

    x1

    xi

    x2

    xi

    x3

    xi

    x4

    xi

    x5

    x2

    x2

    x3x1

    x3

    x5

    x3

    x4

    x5

    This much could be done by pre-computing…

    but at a big memory cost

    (525MB for shootout data)

  • 3/26/2015

    5

    + Allow Subtraction…

    • Example for 5 variables: [ x1 x2 x3 x4 x5 ]

    • One additional group to identify "baseline"

    xi

    0 0 0 0 0 0 1 0 0 0 1 0 0 0 10 0 1 0 0 0 0 0 0 0 0 0 0 1 0

    xi

    x1

    xi

    x2

    xi

    x3

    xi

    x4

    xi

    x5

    0 1 0 0 0

    -xi

    x3 –x2x1 –x2

    x3 –x2

    x5 –x2

    x3 –x2

    x4 –x2

    x5 –x2

    Pre-computation would now require

    2 x10201 variables (for the shootout data)

    variables = 2n (n2+n)

    1050 1100 1150 1200 1250 13002

    3

    4

    5

    Variables

    Me

    an

    1050 1100 1150 1200 1250 13002

    3

    4

    5

    Variables

    Mean

    + Binning to Reduce Dimensionality

    5-fold variable bin Similar effect to use of

    smoothing in derivatives

    650 Variables = 422,500 possible ratios (many quite boring)

    130 variables = 16,900 possible ratios

  • 3/26/2015

    6

    + Genetic Algorithm to Select Terms

    • Try lots of combinations (Calculate variable

    ratios and offsets on-the-fly)

    • Choose best cross-validated results

    • Breed (intermix terms) and repeat

    • Will refer to this as "GA-Norris"

    • Question: Can this approach approximate

    what the interactive Norris approach does?

    Tabulated Results

    RMSEC RMSECV Val 1 Val 2 Test 1 Test 2

    Norris Regression 2.7 2.7 2.8 2.8 3.0 3.3

    Expert-Selected

    Preprocessing2.6 2.7 3.3 4.8 2.8 4.2

    GA Norris (Cal 1 only) 2.4 2.5 3.9 5.0 2.8 3.7

    GA Norris (Cal 1 & 2) 2.8 2.9 3.0 3.0 3.0 3.3

    Simple GA (Cal 1 & 2) 2.6 2.7 3.7 3.8 3.3 3.5

    Selecting Variables based on both

    instruments (building model from ONE)

    yields GA Norris preprocessing which closely

    approximates what Karl Norris did.

  • 3/26/2015

    7

    0 5 10 15 20 250

    0.2

    0.4

    0.6

    0.8

    1

    1.2x 10

    -3

    Hotelling T^2 (97.97%)

    Samples/Scores Plot of c & Test

    0 2 4 6 8 100

    2

    4

    6

    8

    10

    12

    Hotelling T^2 (84.10%)

    Q R

    esid

    uals

    (15.9

    0%

    )

    GA-Norris Model Expert Preprocessing Model

    Outlier Detection Achieved…

    Where Else Would Ratios Help?

    • Raman – correcting for throughput differences

    and offsets

    • LIBS – correcting for throughput differences

    and for emphasizing the importance of

    "relative abundance"

  • 3/26/2015

    8

    Raman of Octene in Toluene

    15

    1000 1100 1200 1300 1400 1500 1600 17000

    2

    4

    6

    8

    10

    12

    Raman Shift (cm-1)

    • Raman spectra measured on 36 solutions of Octene in

    Toluene (3 replicates of 12 concentrations)

    • Calibration set for on-line monitoring of polymerization

    process feed line (octene is comonomer).

    • Little interference or other artifacts in calibration data

    • EXPECT: throughput errors and spectral shifts

    • Calibrate with 24 samples, Validate with 9

    Y Predicted Octene (w/w fraction)

  • 3/26/2015

    9

    RM

    SE

    P

    0

    0.05

    0.1

    0.15

    0.2

    Au

    tosc

    ale

    No

    rma

    lize

    No

    rma

    lize

    (10

    00

    cm

    -1)

    1st

    De

    riva

    tive

    Wh

    itta

    ker

    Ba

    seli

    ne

    Ba

    seli

    ne

    +

    No

    rma

    lize

    GA

    No

    rris

    (0.89) (0.50)

    Prediction Error Vs. Interferences

    Autoscale Scale variables to unit standard deviation

    Normalize Divide by total intensity

    Normalize (1000 cm-1) Divide by intensity at 1000 cm-1 peak

    1st Derivative Savitzky-Golay 1st Derivative (15 point)

    Whittaker Baseline Automatic baseline subtraction

    GA Norris Binning + GA Norris Variable Selection + Ratios

    (All methods also include mean centering)

    RM

    SE

    P

    0

    0.05

    0.1

    0.15

    0.2

    No

    rma

    lize

    (10

    00

    cm

    -1)

    (0.89) (0.50)

    As Measured

    w/ 0.1 cm-1 Axis Shift

    * Throughput Errors

    + Baseline Errors

    Prediction Error Vs. Interferences

  • 3/26/2015

    10

    RM

    SE

    P

    0

    0.05

    0.1

    0.15

    0.2

    Au

    tosc

    ale

    No

    rma

    lize

    No

    rma

    lize

    (10

    00

    cm

    -1)

    1st

    De

    riva

    tive

    Wh

    itta

    ker

    Ba

    seli

    ne

    Ba

    seli

    ne

    +

    No

    rma

    lize

    GA

    No

    rris

    (0.89) (0.50)

    As Measured

    w/ 0.1 cm-1 Axis Shift

    * Throughput Errors

    + Baseline Errors

    Prediction Error Vs. Interferences

    Q Residuals (0.00%)

    Outlier Detection: Baseline+Norm

    ???

    Good predictions but Bad Outlier Status

  • 3/26/2015

    11

    Q Residuals (0.73%)

    Outlier Detection: GA Norris

    LIBS / Raman Classification

    • Mystery classes (natural product, difficult to

    separate classes)

    • Raman data – not much information

    • LIBS data – too much information

    • Anticipate Peak Ratios should help greatly in

    LIBS!

    • Try GA Norris on LIBS

  • 3/26/2015

    12

    Error Rate

    Fewer Selected Peaks

    1000 Selection Results

    Better

    than

    using all

    variables

    Worse

    than

    using all

    variables

    Example GA Norris Results

    Error Rate

    Error Rate

    Error Rate

    Error Rate

    All = 17%

    Best = 8%

    All = 9%

    Best = 2%

    All = 19%

    Best = 0%

    All = 8%

    Best = 3%

    All Classes

  • 3/26/2015

    13

    With Randomized Classes (Red)

    Non-linear model

    + variable selection

    + large domain

    = large chance of over-fit

    = use caution & permutation tests

  • 3/26/2015

    14

    Conclusions

    • GA Norris can reproduce Norris Regression

    results

    • Can be used to achieve similar results to

    standard preprocessing (but with less sound

    decisions!)

    • Large chance of over-fit = use caution &

    permutation tests, or standard methods!!