5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for...
Transcript of 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for...
![Page 1: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/1.jpg)
Data Preprocessing for Quantitative and Qualitative Models
Based on NIR Spectroscopy
1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650
2.5
3
3.5
4
4.5
5
5.5
Wavelength (nm)
Raw
Sig
nal
Tablet NIR Spectra Raw Data
Colored by Assay Value160
170
180
190
200
210
220
230
1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Wavelength (nm)
Prep
roce
ssed
Sig
nal
Tablet NIR Spectra Preprocessed with MSC, First Derivative and Mean Centering
160
170
180
190
200
210
220
230
Barry M. Wise, Ph.D.President
Eigenvector Research, Inc.Manson, WA USA
![Page 2: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/2.jpg)
• Preprocessing Objective• Definition of Clutter• Linearization• Mean Centering and Autoscaling• Baseline Removal• Normalization, Multiplicative Scatter Correction
(MSC)• Smoothing, Filtering and Derivatives• Orthogonalization Filters: EPO, GLS• Conclusions
Outline
2
![Page 3: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/3.jpg)
Goal of Preprocessing
3
• Data preprocessing is what you do to the data before it hits the modeling algorithm (PCA, PLS, MCR, SIMCA, etc.)
• The goal of preprocessing is to remove variation you don’t care about, i.e. clutter, in order to let the analysis focus on the variation you do care about
• Examples• Systems with scattering: physical vs. chemical effects• Classification: intra-class vs. inter-class variation
![Page 4: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/4.jpg)
Measured Signal
• Clutter is present in all measurements (X & Y)• clutter = interferences + noise not of interest
X = csT + Xc +E
Measured SignalTarget Signal
Clutter Signal
Interference Signal
Noise
csT
Xc
Xc +E E
4
![Page 5: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/5.jpg)
Sources of Clutter• Systematic background variability
• Variation in chemical interferents• Physical effects such as scattering due to particles
• Other changes in the system being observed• T, P changes, variable sample matrix, “dark current”
• Variance due to physics of instrument• e.g., drift, instrument changes, variable baseline or gain• Non-linearity, saturation
• Non-systematic random noise• homoscedastic, heteroscedastic
5
![Page 6: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/6.jpg)
Reasons to Preprocess
• Reduces variance from extraneous sources• Makes relevant variance more obvious• Makes statistics work better• Aids interpretation• Avoids numerical problems
6
![Page 7: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/7.jpg)
Transformation to Linear Form• Within X-block (predictor variables, e.g. spectra)
• PCA works best with linear relationships• Between X- & Y-block (predicted variable)
• PLS regression assumes linear relation• If possible, non-linear data should be converted to
a linear form (e.g., use known physics of the system)• Example:
• Typically work with absorbance rather than transmittance
• Log(I/I0)
7
![Page 8: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/8.jpg)
Mean Centering• Often we are most interested in how the
data varies around the mean• Mean centering is done by subtracting the
mean off each column, thus forming a matrix where each column has mean of zero
8
![Page 9: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/9.jpg)
Centering is an Axis Translation
• Geometry for 2 variables
Variable 1
Varia
ble
2
Mean Vector
Variable 1Va
riabl
e 2
9
![Page 10: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/10.jpg)
Mean Centering on Spectra
10
600700800900100011001200130014001500Wavenumber (1/cm)
0
0.2
0.4
0.6
0.8
1
Abso
rban
ce
FTIR of Edible Oils
CornOliveSafflCMarg
600700800900100011001200130014001500Wavenumber (1/cm)
-0.05
0
0.05
0.1
Abso
rban
ce
FTIR of Edible Oils Mean Centered
CornOliveSafflCMarg
![Page 11: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/11.jpg)
Variable Scaling• Scaling is done to change variance of variables,
and thus the weight given to them in modeling• Most common is autoscaling, which makes
variables unit variance and mean zero• Mean center variables• Divide by standard deviation
• Autoscaling removes all scale information• What’s left is only how the variables correlate
with each other• it is the “correlation matrix”
11
![Page 12: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/12.jpg)
Autoscaling on Spectra
12
600700800900100011001200130014001500Wavenumber (1/cm)
-0.05
0
0.05
0.1
Abso
rban
ce
FTIR of Edible Oils Mean Centered
CornOliveSafflCMarg
600700800900100011001200130014001500Wavenumber (1/cm)
-3
-2
-1
0
1
2
3
Abso
rban
ce (A
utos
cale
d)
FTIR of Edible Oils Autoscaled
CornOliveSafflCMarg
![Page 13: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/13.jpg)
4
3
2
1
0
-1
Raw FTIR Spectra
4000 3500 3000 2500 2000 1500 1000 500Wavenumbers
Baseline Close Up0.15
0.1
0.05
0
-0.054000 3500 3000 2500 2000 1500 1000 500
Wavenumbers
Sample-to-Sample BaselineBaselines can exhibit simple offsets, slopes, polynomials or more complicated functions.
In the example, the offset is larger than the absorbance features of interest.
Adds variance that can inhibit predictive capability and make extraction of chemical information (e.g., via multivariate curve resolution) difficult.
13
![Page 14: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/14.jpg)
Background/Baseline Subtraction
Removal of broad (low-frequency) interferences while retaining higher-frequency features. Only low-order polynomials are used to model the background.• Detrend: fit polynomial to entire spectrum • Selected-Points baselining: fit polynomial to selected points in
spectrum• Weighted Least-squares (a.k.a. asymmetric) baselining: fit to automatically selected points on the bottom of the spectrum• Windowed: Whittaker, Rolling Ball, Median, Minimum, etc.• Etc.
14
![Page 15: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/15.jpg)
Selected-Points Baseline• Detrend based on points in spectrum known to be
only baseline. Subtract the result from all channels.• good when zero points are known a priori
15
2003004005006007008009001000Variables
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
Mean
Baseline Order: 3
![Page 16: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/16.jpg)
Weighted Least-Squares Baselining• Automatic selection of baseline points by fitting polynomial to
the “bottom” (or “top”) of the spectrum à asymmetric fit.• Starts with a fit to all points (detrend) then de-weights points above the baseline (those with large
positive residuals).• Iterates until only points w/in a defined tolerance on the residuals are kept. (Need to define tolerance
on the residuals.)• Easy approach for simple baselines (e.g., polynomials).• Can also include known baseline functions.
600 800 1000 1200 1400 1600 18000
1
2
3
4
5
Raman Shift (cm-1)
16
![Page 17: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/17.jpg)
Sample Normalization Methods• Previous examples removed an offset. How is variance due
to changing magnitude removed?• variable source or lighting magnitude• scattering effects
• Row Normalization: removes magnitude• Standard Normal Variate (SNV): subtracts the row mean
from each row and scales to unit variance• Autoscaling of the rows
• Multiplicative Scatter Correction: Determines scale factor that best fits new spectrum to reference
• Be aware that these can “blow up” low signal noisy samples to have more variance
17
![Page 18: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/18.jpg)
Normalization• Normalize each row / spectrum • Order of normalization (p-norm)
• 1-norm : normalize to unit AREA (area = 1)• 2-norm : normalize to unit LENGTH (vector length = 1)• inf-norm : normalize to unit MAXIMUM (max value = 1)
1/
1
pN
p
jjx
=
æ ö= ç ÷
è øåx x
p-norm
18 600700800900100011001200130014001500Wavenumber (1/cm)
-0.05
0
0.05
0.1
Abso
rban
ce
FTIR of Edible Oils Mean Centered
CornOliveSafflCMarg
600700800900100011001200130014001500Wavenumber (1/cm)
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Abso
rban
ce
10-3FTIR of Edible Oils Normalized and Centered
CornOliveSafflCMarg
![Page 19: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/19.jpg)
Scatter / Signal Correction
• Multiplicative Scatter Correction (MSC)• Attempts to remove offset and row magnitude
variability• Result is less signal related to scattering artifacts
and more signal related to analyte(s) of interest
19
![Page 20: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/20.jpg)
MSC Example
20
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Scores on PC 1 (75.52%)
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Scor
es o
n PC
2 (1
4.36
%)
PCA Scores of Edible Oils Centered
CMargCornOliveSaffl
-0.2 -0.15 -0.1 -0.05Scores on PC 1 (75.52%)
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06Sc
ores
on
PC 2
(14.
36%
)Zoom in on Olive Oils
10 11 12
13 14
15 16
17 18
19
20
21
22
23
24
![Page 21: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/21.jpg)
21
Spectra (Selected Wavelengths)Samples 19 & 22
Sample 22
Sample 19
Sample 22 looks uniformly larger than Sample 19
![Page 22: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/22.jpg)
22
Plot Sample 22 vs. Sample 19
Identity Line
Slope = 1.0584Int = 0
![Page 23: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/23.jpg)
23
Multiplicative Effect:Spectra are Identical except one
is a Multiple of the Other• Changing sample pathlength, e.g. changing
light scattering with particle size.• Changing sample density, e.g. changing
temperature of sample.• Changing gain of the instrument.
![Page 24: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/24.jpg)
24
MSC Multiplicative Signal (Scatter) Correction
Identity Line
Divide each absorbance of Sample 22 by slope = 1.0584
![Page 25: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/25.jpg)
With MSC
25
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)
-0.1
-0.05
0
0.05
0.1
0.15
0.2Sc
ores
on
PC 2
(22.
33%
)PCA Scores of Edible Oils with MSC and Centered
CMargCornOliveSaffl
![Page 26: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/26.jpg)
Savitzky-Golay Smoothing and Derivatives
• Derivatives wrt λ can be used to remove offsets/slopes
• Savitzky-Golay smoothing and derivatives• piece-wise fit of polynomials to
each spectrum• use fit directly for smoothing• use derivative in each window for
estimate of derivative wrt λ• smooth + derivative can be boiled
down to a set of coefficients
800 900 1000 1100 1200 1300 1400 1500 16000
0.5
1
1.5
2
2.5
Wavelength, l
Abs
orba
nce
with offset and slope
with offset
original spectrum
26
![Page 27: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/27.jpg)
Savitzky-Golay First Derivative
x = cST
x = cST +α1T
dxdλ
= cdST
dλ
800 900 1000 1100 1200 1300 1400 1500 1600-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
Wavelength, l
dA/dl
multicomponent Beer’s Law
first derivative removes the offset
with offset and slope
with offset
original spectrum
27
![Page 28: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/28.jpg)
800 900 1000 1100 1200 1300 1400 1500 1600-0.015
-0.01
-0.005
0
0.005
0.01
0.015
Wavelength, l
d2 A/dl2
Savitzky-Golay Second Derivative
x = cST
x = cST +α1T + βλdxdλ
= cdST
dλ+ β
d2xdλ 2 = c
d2ST
dλ 2
multicomponent Beer’s Law
second derivative remove the offset and slope
with offset and slope
with offset
original spectrum
28
![Page 29: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/29.jpg)
EPO and GLS Filters• EPO = External Parameter Orthogonalization• GLS = Generalized Least Squares filter• Both use samples that characterize the clutter
• Variation not related to the problem of interest• Classification problems: inter-class variance• Regression problems: samples with same property
• EPO makes PCA model of clutter, orthogonalizes data against first few PCs – hard filter
• GLS calculates weighted inverse of clutter covariance, applies to all data – soft filter
29
![Page 30: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/30.jpg)
With MSC and GLS
30
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)
-0.1
-0.05
0
0.05
0.1
0.15
0.2Sc
ores
on
PC 2
(22.
33%
)PCA Scores of Edible Oils with MSC and Centered
CMargCornOliveSaffl
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (73.38%)
-0.1
-0.05
0
0.05
0.1
0.15
0.2Sc
ores
on
PC 2
(25.
05%
)PCA Scores of Edible Oils with MSC, GLS & Centered
-0.1 -0.05 0 0.05 0.1 0.15Scores on PC 1 (73.38%)
-0.05
0
0.05
0.1
0.15
Scor
es o
n PC
2 (2
5.05
%)
PCA Scores of Test Samples with MSC, GLS & Centered
![Page 31: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/31.jpg)
NIR Shootout 2002• Estimate tablet assay value from NIR transmittance
• Calibration (155 samples), Test (460 samples)
311100 1200 1300 1400 1500 1600
Wavelength (nm)
2.5
3
3.5
4
4.5
5
5.5
Sign
al
Raw Tablet Data Colored by Assay Value
160
170
180
190
200
210
220
230
1100 1200 1300 1400 1500 1600Variables
2.5
3
3.5
4
4.5
Data
MSC Preprocessed Tablet Data Colored by Assay Value
160
170
180
190
200
210
220
230
1100 1200 1300 1400 1500 1600Variables
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
Data
MSC & 1st Derivative Preprocessed Tablet Data
160
170
180
190
200
210
220
230
1100 1200 1300 1400 1500 1600Variables
-0.015
-0.01
-0.005
0
0.005
0.01
Data
MSC, 1st Derivative and Mean Centered Tablet Data
160
170
180
190
200
210
220
230
![Page 32: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/32.jpg)
Prediction Error on Validation Set
0
1
2
3
4
5
6
7
8
9
10
Preprocessing Method
RM
SEP
for V
alid
atio
n Se
t
Mea
n C
ente
r
MSC
1-N
orm
2-N
orm
SavG
ol-1
,SN
V
SavG
ol-2
,SN
V
SNV EM
SC
SavG
ol-2
,EM
SC
GLS
W
SavG
ol-2
,SN
V,G
LSW
150 160 170 180 190 200 210 220 230 240 250150
160
170
180
190
200
210
220
230
240
250Assay SavGol-2,SNV,GLSW
Measured
Pred
icte
d
RMSEP = 2.7
CalibrationValidation
32
![Page 33: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/33.jpg)
Perspectives on Preprocessing• Order matters. The general approach is:
1. Background and offset removal2. Normalization3. Centering4. Scaling
• Always keep in mind: “what is each preprocessing step supposed to be doing?....”
• Plot data after pre-preprocessing and color code!• Always compare the effect of the pre-processing
(classification or regression error rates) with the results from a model based on the raw data
33
![Page 34: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250](https://reader033.fdocuments.us/reader033/viewer/2022052006/601a4621e9df9230b70ae269/html5/thumbnails/34.jpg)
Pre-processing will offer…
• Models with better predictive or classification performance and/or
• Simpler models that are more robust and/or more easy to interpret
• But there is a risk that you can remove useful information from data
• Preprocessing must be validated as part of the model development process
34