Download - Data Normalization Approaches for Large-scale Biological Studies

Data Normalization Approaches for Large-Scale Metabolomic Studies

Dmitry Grapov, PhD

Analytical Variance

Variation in sample measurements stemming from sample handling, data acquisition, processing, etc• Can modify or mask true biological variability• Calculated based on variance in replicated measurements• Can be accounted for using data normalization approachesGoal- minimize analytical variance using data normalization

Drift in >400 replicated measurements across >100 batches

Need for Normalization

To remove non-biological (e.g. analytical) drift/variance/artifacts in measurements

Acquisition order Processing/acquisition batches

SamplesQuality Controls (QCs)

Quantifying Data Quality (precision)Calculate median inter- and intra-batch %RSD (for replicated measurements)

Analyte specific performance across whole study

Within batch performance

Visualizing Performance

Intra-batch (within) precision for normalization methods

Inter-batch (across) precision for normalization methods

RSD = relative standard deviation = standard deviation/mean

Visualizing Metabolite Performance

acquisition time

batch

Univariate Multivariate

PCA

Common Normalization Approaches

Sample-wise scalar corrections• L2 norm, mean, median, sum, etc.

Internal standard (ISTD) • Ratio response (metabolite/ISTD)

• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)

• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)

Quality control (QC) or reference sample• Batch ratio (mean, median)

• Loess (doi:10.1038/nprot.2011.335; locally estimated scatterplot smoothing)

• Hierarchical mixed effects (Jauhiainen et al. 2014)

• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)

Variance Based• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)

• Variance stabilizing normalization (Huber et al. 2002)

Evaluation of Normalizations

Use QC to define:• Median within batch %RSD• Median analyte study wide %RSD• All normalization specific parameters

• Split QCs into training and test set• Optimize tuning parameters using leave-one-out

cross-validation• Assess performance on test set

Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r

Scalar Normalization

Calculate sample-specific scalar to ensure each sample’s (sum, mean, median, etc) signal is equivalent

• Using sum signal normalization (sum norm) assumes equivalent total metabolite signal per sample

• Can correct for batch effects when valid

BMC Bioinformatics 2007, 8:93 doi:10.1186/1471-2105-8-93

Theses normalizations may hide true biological trends or create false ones

After sum norm phospholipids seem lower in ob/ob when in reality theses are the same as in wt samples

Batch Ratio (BR) Normalization

Use QCs to calculate: 1. batch/analyte specific

correction factor = (batch median /global median)

2. Apply ratio to samples

• simple

LOESS Normalization (local smoothing)

For each analyte use QCs to:• Tune LOESS model (span or degree of smoothing)• LOESS model to remove analytical variance from samples

raw LOESS normalized

LOESS Normalization

LOESS span has a large effect model fit

span (α) defines the degree of smoothing and is critical for controlling overfitting

LOESS Normalization

raw samples (red) normalized based on QCs (black)

model is trained on QCs and applied to samples

span: too high just right?

Can not assume convergence of training and test performance because test data has analytical + biological variance

LOESS NormalizationAvoiding over fitting is critical using the LOESS normalization

Exammple LOESS Normalization

raw span =0.75 span =0.005

Metabolomic Data Case Study I

GC-TOF• 310 metabolites for 4930 samples • 132 batches

• ~41 samples per batch • ~1:10 QCs/samples (487 QCs or 9%)• No Internal Standards (ISTDs)

Normalizations Implemented• Batch ratio • LOESS • Sum known metabolite signal (mTIC) normalization

Batch Performance (GC-TOF Raw)

Within batch• Median: 26 • Min: 19• Max: 69

MedianRSD count cumulative %10-20 3 220-30 98 7630-40 26 9640-50 3 9850-60 1 9960-70 1 100

MedianRSD count cumulative %0-10 10 310-20 83 3020-30 100 6230-40 69 8440-50 32 9450-60 6 9660-70 3 9770-80 5 9880-90 1 9990-100 1 100

Analyte Performance (GC-TOF Raw)

Within Batch • Median: 24 • Min: 7• Max: 79

PCA (GC-TOF Raw)

Within batches • Median: 23 • Min: 17• Max: 69

MedianRSD count cumulative %

10-20 25 2320-30 67 8530-40 15 9940-50 1 10060-70 1 101

Batch Performance (GC-TOF BR)

MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100

Across batches • Median: 24 • Min: 7• Max: 79

Batch Performance (GC-TOF BR)

PCA (GC-TOF BR)

BR Normalization Limitations

• Very susceptible to outliers

• Requires many QCs• Can inflate variance

when training and test set trends do not match

Within batches • Median: 19 • Min: 11• Max: 58

MedianRSD count cumulative %

10-20 75 5720-30 51 9630-40 4 9940-50 1 9950-60 1 100

Batch Performance (GC-TOF LOESS)

MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100

Across batches • Median: 19• Min: 2.9• Max: 66

Batch Performance (GC-TOF LOESS)

PCA (GC-TOF LOESS)

LOESS Normalization Limitations

raw normalized

LOESS normalization can inflate variance when:• overtrained• training examples do

not match test set

Sum mTIC Normalization (GC-TOF)

Improved performance over raw and BR, but alters data from magnitudinal to compositional

Sum mTIC Normalization (GC-TOF)

Poor removal of trends due to acquisition time, but limits magnitude of outliers samples compared to other approaches

time

Raw

mTIC Normalized

Metabolomic Data Case Study II

LC-Q-TOF• 340+ metabolites for 4930 samples • 132 batches

• ~41 samples per batch • ~1:10 QC/samples (524 QCs or 11%)• NIST reference (63 or 1%)• 14 internal standards (ISTDs)

• NOMIS (IS = ISTD)• qcISTD

Internal Standards Normalization

Anal

yte

Retention time

Internal standards (ISTD) • qcISTD(QC optimized

metabolite/ISTD)

• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)

• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)

NOMIS

ISTD Based Normalizations (LC/Q-TOF)

• NOMIS (linear combination of optimal ISTDs; Sysi-Aho et al., 2007)

• qcISTD (QC optimized ISTD strategy)

PC 38:6

Poor performance with NOMIS

qcISTD Normalization

Use QC samples to:1. Evaluate analyte %RSD

before and after corrections using all ISTDs

2. Select analyte/ISTD combinations with %RSD improvement over raw data at some threshold (e.g 10%)

3. Correct sample analytes with QC defined ISTD if ISTD recovery is above some minimal threshold (e.g. > 20% of median)

• Subject to overfitting

191 of 326 (60%) are ISTD corrected

qcISTD Normalization

ISTD used by retention time (Rt) Total number of analytes corrected by ISTD

Optimal Lipidomic ISTDS

Normalizations (LC-Q-TOF)

LOESS performs very poorly for two metabolites

• qcISTD performs better than LOESS • qcISTD + LOESS leads to highest replicate

precision

PCA (LC/Q-TOF)

Raw (%RSD = 13) qcISTD (9)

LOESS (12)

qcISTD + LOESS (8)

Only LOESS included normalizations effectively remove analytical batch effects

Conclusion

• Comparison of common data normalization approaches suggests that in addition to ISTD corrections, LOESS (analyte-specific, non-linear adjustment based on QC performance at various data acquisition times) is superior to batch based corrections.

• Further validations need to be completed to confirm the effects of normalizations on samples’ variance

• These findings suggest that inclusion of “batch” as a covariate in statistical models will not fully account for analytical variance

R code for all normalization functions can be found at :https://github.com/dgrapov/devium/blob/master/R/Devium%20Normalization.r

[email protected] metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154