Data Normalization Approaches for Large-Scale Metabolomic Studies
Dmitry Grapov, PhD
Analytical Variance
Variation in sample measurements stemming from sample handling, data acquisition, processing, etc• Can modify or mask true biological variability• Calculated based on variance in replicated measurements• Can be accounted for using data normalization approachesGoal- minimize analytical variance using data normalization
Drift in >400 replicated measurements across >100 batches
Need for Normalization
To remove non-biological (e.g. analytical) drift/variance/artifacts in measurements
Acquisition order Processing/acquisition batches
SamplesQuality Controls (QCs)
Quantifying Data Quality (precision)Calculate median inter- and intra-batch %RSD (for replicated measurements)
Analyte specific performance across whole study
Within batch performance
Visualizing Performance
Intra-batch (within) precision for normalization methods
Inter-batch (across) precision for normalization methods
RSD = relative standard deviation = standard deviation/mean
Visualizing Metabolite Performance
acquisition time
batch
Univariate Multivariate
PCA
Common Normalization Approaches
Sample-wise scalar corrections• L2 norm, mean, median, sum, etc.
Internal standard (ISTD) • Ratio response (metabolite/ISTD)
• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)
Quality control (QC) or reference sample• Batch ratio (mean, median)
• Loess (doi:10.1038/nprot.2011.335; locally estimated scatterplot smoothing)
• Hierarchical mixed effects (Jauhiainen et al. 2014)
• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)
Variance Based• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)
• Variance stabilizing normalization (Huber et al. 2002)
Evaluation of Normalizations
Use QC to define:• Median within batch %RSD• Median analyte study wide %RSD• All normalization specific parameters
• Split QCs into training and test set• Optimize tuning parameters using leave-one-out
cross-validation• Assess performance on test set
Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r
Scalar Normalization
Calculate sample-specific scalar to ensure each sample’s (sum, mean, median, etc) signal is equivalent
• Using sum signal normalization (sum norm) assumes equivalent total metabolite signal per sample
• Can correct for batch effects when valid
BMC Bioinformatics 2007, 8:93 doi:10.1186/1471-2105-8-93
Theses normalizations may hide true biological trends or create false ones
After sum norm phospholipids seem lower in ob/ob when in reality theses are the same as in wt samples
Batch Ratio (BR) Normalization
Use QCs to calculate: 1. batch/analyte specific
correction factor = (batch median /global median)
2. Apply ratio to samples
• simple
LOESS Normalization (local smoothing)
For each analyte use QCs to:• Tune LOESS model (span or degree of smoothing)• LOESS model to remove analytical variance from samples
raw LOESS normalized
LOESS Normalization
LOESS span has a large effect model fit
span (α) defines the degree of smoothing and is critical for controlling overfitting
LOESS Normalization
raw samples (red) normalized based on QCs (black)
model is trained on QCs and applied to samples
span: too high just right?
Can not assume convergence of training and test performance because test data has analytical + biological variance
LOESS NormalizationAvoiding over fitting is critical using the LOESS normalization
Exammple LOESS Normalization
raw span =0.75 span =0.005
Metabolomic Data Case Study I
GC-TOF• 310 metabolites for 4930 samples • 132 batches
• ~41 samples per batch • ~1:10 QCs/samples (487 QCs or 9%)• No Internal Standards (ISTDs)
Normalizations Implemented• Batch ratio • LOESS • Sum known metabolite signal (mTIC) normalization
Batch Performance (GC-TOF Raw)
Within batch• Median: 26 • Min: 19• Max: 69
MedianRSD count cumulative %10-20 3 220-30 98 7630-40 26 9640-50 3 9850-60 1 9960-70 1 100
MedianRSD count cumulative %0-10 10 310-20 83 3020-30 100 6230-40 69 8440-50 32 9450-60 6 9660-70 3 9770-80 5 9880-90 1 9990-100 1 100
Analyte Performance (GC-TOF Raw)
Within Batch • Median: 24 • Min: 7• Max: 79
PCA (GC-TOF Raw)
Within batches • Median: 23 • Min: 17• Max: 69
MedianRSD count cumulative %
10-20 25 2320-30 67 8530-40 15 9940-50 1 10060-70 1 101
Batch Performance (GC-TOF BR)
MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100
Across batches • Median: 24 • Min: 7• Max: 79
Batch Performance (GC-TOF BR)
PCA (GC-TOF BR)
BR Normalization Limitations
• Very susceptible to outliers
• Requires many QCs• Can inflate variance
when training and test set trends do not match
Within batches • Median: 19 • Min: 11• Max: 58
MedianRSD count cumulative %
10-20 75 5720-30 51 9630-40 4 9940-50 1 9950-60 1 100
Batch Performance (GC-TOF LOESS)
MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100
Across batches • Median: 19• Min: 2.9• Max: 66
Batch Performance (GC-TOF LOESS)
PCA (GC-TOF LOESS)
LOESS Normalization Limitations
raw normalized
LOESS normalization can inflate variance when:• overtrained• training examples do
not match test set
Sum mTIC Normalization (GC-TOF)
Improved performance over raw and BR, but alters data from magnitudinal to compositional
Sum mTIC Normalization (GC-TOF)
Poor removal of trends due to acquisition time, but limits magnitude of outliers samples compared to other approaches
time
Raw
mTIC Normalized
Metabolomic Data Case Study II
LC-Q-TOF• 340+ metabolites for 4930 samples • 132 batches
• ~41 samples per batch • ~1:10 QC/samples (524 QCs or 11%)• NIST reference (63 or 1%)• 14 internal standards (ISTDs)
• NOMIS (IS = ISTD)• qcISTD
Internal Standards Normalization
Anal
yte
Retention time
Internal standards (ISTD) • qcISTD(QC optimized
metabolite/ISTD)
• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)
NOMIS
ISTD Based Normalizations (LC/Q-TOF)
• NOMIS (linear combination of optimal ISTDs; Sysi-Aho et al., 2007)
• qcISTD (QC optimized ISTD strategy)
PC 38:6
Poor performance with NOMIS
qcISTD Normalization
Use QC samples to:1. Evaluate analyte %RSD
before and after corrections using all ISTDs
2. Select analyte/ISTD combinations with %RSD improvement over raw data at some threshold (e.g 10%)
3. Correct sample analytes with QC defined ISTD if ISTD recovery is above some minimal threshold (e.g. > 20% of median)
• Subject to overfitting
191 of 326 (60%) are ISTD corrected
qcISTD Normalization
ISTD used by retention time (Rt) Total number of analytes corrected by ISTD
Optimal Lipidomic ISTDS
Normalizations (LC-Q-TOF)
LOESS performs very poorly for two metabolites
• qcISTD performs better than LOESS • qcISTD + LOESS leads to highest replicate
precision
PCA (LC/Q-TOF)
Raw (%RSD = 13) qcISTD (9)
LOESS (12)
qcISTD + LOESS (8)
Only LOESS included normalizations effectively remove analytical batch effects
Conclusion
• Comparison of common data normalization approaches suggests that in addition to ISTD corrections, LOESS (analyte-specific, non-linear adjustment based on QC performance at various data acquisition times) is superior to batch based corrections.
• Further validations need to be completed to confirm the effects of normalizations on samples’ variance
• These findings suggest that inclusion of “batch” as a covariate in statistical models will not fully account for analytical variance
R code for all normalization functions can be found at :https://github.com/dgrapov/devium/blob/master/R/Devium%20Normalization.r
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154
Top Related