1 Statistics in Metabolomics David Banks ISDS Duke University.

41
1 Statistics in Metabolomics David Banks ISDS Duke University

Transcript of 1 Statistics in Metabolomics David Banks ISDS Duke University.

1

Statistics in Metabolomics

David Banks

ISDS

Duke University

2

1. Background

Metabolomics is the next step after genomics and proteomics.

There are about 25,000 genes, most of which have unknown functions.

There are about 1,000,000 proteins, most of which are unstudied.

3

C o n f i d e n t i a l

D N A

R N A

P r o t e i n

B i o c h e m i c a l s ( M e t a b o l i t e s )

G e n o m i c s – 2 5 , 0 0 0 G e n e s

T r a n s c r i p t o m i c s – 1 0 0 , 0 0 0 T r a n s c r i p t s

M e t a b o l o m i c s – 2 , 4 0 0 C o m p o u n d s

P r o t e o m i c s – 1 , 0 0 0 , 0 0 0 P r o t e i n s

N

NNH

N

N H 2

N H 2

C HC

H 2C

O H

O C HC H 3

C H 3

O

H

H O

H

H O

H

O H

O HHH

O H

4

In contrast to the *omics areas: There are only about 900 main

metabolites, and we know their chemical structures

Also, we know (pretty well) the biochemical pathways that determine their production rates

Metabolites are low-weight molecular compounds produced in the course of processing raw materials.

5

Some common metabolites include: cholesterol glucose, sucrose, fructose amino acids lactic acid, uric acid ATP, ADP drug metabolites, legal and illegal

These are produced in metabolic pathways, such as the Krebs (citrate) cycle for oxidation of glucose.

6

7

These pathways contain important information about the amount of each metabolite:

Stoichiometric equations show how much material is produced in a given reaction; i.e., mass balance.

Rate equations govern the speed at which reactions take place, and the location of the Gibbs equilibrium

This gives metabolomics an edge.

8

Biochemical Profile Map to Metabolic Pathways

Biochemical Profile

9

The purposes of metabolomics are: Early detection of disease, such as

necrosis, ALS, Alzheimer’s, and infection or inflammation.

Assessment of toxicity (especially liver toxicity) in new drugs.

Diet strategies, drug testing. Elucidating biochemical pathways.

There is less raw information than for other *omics, but more context.

10

2. Measurement Issues

To obtain data, a tissue sample is taken from a patient. Then:

The sample is prepped and put onto wells on a silicon plate.

Each well’s aliquot is subjected to gas and/or liquid chromatography.

After separation, the sample goes to a mass spectrometer.

11

The sample prep involves stabilizing the sample, adding spiked-in calibrants, and creating multiple aliquots (some are frozen) for QC purposes. This is roboticized.

Sources of error in this step include: within-subject variation within-tissue variation contamination by cleaning solvents calibrant uncertainty evaporation of volatiles.

12

Gas chromatography creates an ionized aerosol, and each droplet evaporates to a single ion. This is separated by mass in the column, then ejected to the spectrometer.

Sources of error in this step include: imperfect evaporation adhesion in the column ion fragmentation or adductance

13

The fourier mass spectrometer determines the mass to charge ratio of the ion from the field strength required to keep the ion spinning in a circle. This avoids the entry-time uncertainty in TOF machines, so the only main error is uncertainty about the field strength

Some laboratories use MALDI-TOF equipment, and the error sources are slightly different.

14

15

The result of this is a set of m/z ratios and timestamps for each ion, which can be viewed as a 2-D histogram in the m/z x time plane.

One now estimates the amount of each metabolite. This entails normalization, which also introduces error.

The caveats pointed out in Baggerley et al. (Proteomics, 2003) apply.

16

17

3. Statistical Problems

Understanding the uncertainty budget in metabolomic data, which entails both quality control and cross-platform comparisons.

Identifying the peaks in the m/z x t plane, and estimating quantity of specific metabolites.

Finding markers for disease or toxicity, or measuring change.

18

3.1 Uncertainty

The classical NIST approach to this is to:

build a model for the error terms do a designed experiment with

replicated measurements fit a measurement equation to the

dataSee Cameron, “Error Analysis,” ESS

Vol. 9, 1982.

19

Let z be the vector of raw data, and let x be the estimates. Then the measurement equation is:

G(z) = x = µ + ε

where µ is the vector of unknown true values and ε is decomposable into separate components.

For metabolite i, the estimate Xi is:

gi(z) = lnΣ wij ∫∫sm(z) – c(m,t)dm dt.

20

The law of propagation of error (this is essentially the delta method) says that the variance in X is about

Σni=1 (∂g /∂ zi)2 Var[zi] +

Σi≠k 2 (∂g/∂zi)(∂g/∂zk) Cov[zi, zk]

The weights depend upon the values of the spiked in calibrants, so this gets complicated.

21

Cross-platform experiments are also crucial for medical use. This leads to key comparison designs. Here the same sample (or aliquots of a standard solution or sample) are sent to multiple labs. Each lab produces its spectrogram.

It is impossible to decide which lab is best, but one can estimate how to adjust for interlab differences.

22

The Mandel bundle-of-lines model is what we suggest for interlaboratory comparisons. This assumes:

Xik = αi + βi θk + εik

where Xik is the estimate at lab i for metabolite k, θk is the unknown true quantity of metabolite k, and

εik ~ N(0,σik2).

23

To solve the equations given values from the labs, one must impose constraints. A Bayesian can put priors on the laboratory coefficients and the error variance.

Metabolomics needs a multivariate version, with models for the rates at which compounds volatilize.

We plan to use this model to compare the Metabolon lab in RTP to Chris Newgard’s lab at Duke.

24

3.2 Peak Identification

A classic problem in proteomics is to locate peaks and estimate their area or volume.

Unlike proteomics, metabolite peak location is mostly known. So Bayesian methods seem good (cf. Clyde and House). Metabolon uses proprietary software.

25

Confidential

GC Data

26

Confidential

Tissue Differences

27

Cancer Type - CNS cancer

Cancer Type - leukemia

Cancer Type - ovarian cancer

Cancer Type - breast cancer

Cancer Type - melanoma

Cancer Type - prostate cancer

Cancer Type - colon cancer

Cancer Type - non small cell lung cancer

Cancer Type - renal cancer

28

3.3 Data Mining

Different tools are appropriate for different kinds of metabolomic studies. The work we have done focuses on:

Random Forests Support Vector Machines Robust Singular Value

Decomposition

29

We had abundance data on 317 metabolites from 63 subjects. Of these, 32 were healthy, 22 had ALS but were not on medication, and 9 had ALS and were taking medication.

The goal was to classify the two ALS groups and the healthy group.

Here p>n. Also, some abundances were below detectability.

30

Using the Breiman-Cutler code for Random Forests, the out-of-bag error rate was 7.94%; 29 of the ALS patients and 29 of the healthy patients were correctly classified.

20 of the 317 metabolites were important in the classification, and three were dominant.

RF can detect outliers via proximity scores. There were four such.

31

Several support vector machine approaches were tried on this data:

Linear SVM Polynomial SVM Gaussian SVM L1 SVM (Bradley and Mangasarian,

1998) SCAD SVM (Fan and Li, 2000)

The SCAD SVM had the best loo error rate, 14.3%.

32

The L1 SVM attempts to mimic the automatic variable selection in the LASSO (Tibshirani, 1996) by solving the programming problem:

Minb,w Σ[1 – yi(b+wTxi)]+ + λΣ | wk |

where the first sum is over n and the second is over p.

SCAD replaces the L1 penalty with a nonconvex penalty.

33

The SCAD SVM selected 18 of the metabolites as being important; the L1 selected 32. This suggests that the automatic variable selection in L1 SVM is not very effective.

A further multiple tree analysis with FIRMPlusTM software from the GoldenHelix Co. did not achieve good classification.

So Random Forests wins. And the

selected metabolites make sense.

34

Robust SVD (Liu et al., 2003) is used to simultaneously cluster patients (rows) and metabolites (columns). Given the patient by metabolite matrix X, one writes

Xik = ri ck + εik

where ri and ck are row and column effects. Then one can sort the array by the effect magnitudes.

35

To do a rSVD use alternating L1

regression, without an intercept, to estimate the row and column effects. First fit the row effect as a function of the column effect, and then reverse. Robustness stems from not using OLS.

Doing similar work on the residuals gives the second singular value solution.

36

37

3.3.1 Preterm Labor

The NIH wanted to decide whether amniotic fluid samples from women in preterm labor could support classification:

Term delivery Preterm delivery with inflammation Preterm delivery without

inflammation.

38

The analysis had samples from 113 women in preterm labor. We tried all of the usual classification methods.

As before, Random Forests gave the best results. The various SVMs were about 5-10% less predictive.

The main information was contained in amino acids and carbohydrates.

39

Predicted

Term Inflamm. No Inf.

Term 39 1 0

True Inflamm. 7 32 1

No Inf. 2 2 29

RF accuracy was 100/113 = 88.49%.

40

For those with term delivery, amino acids were low, carbohydrates were high.

For those who had preterm delivery without inflammation, both amino acids and carbohydrates were low.

For those who had inflammation, the carbohydrates were very low and the amino acids were high.

41

My collaborators in this research are:

Chris Beecher, Metabolon, Inc. Adele Cutler, USU Leanna House, Duke University Jackie Hughes-Oliver, NCSU Xiadong Lin, U. of Cincinnati Susan Simmons, UNC-Wilmington Young Truong, UNC-Chapel Hill Stan Young, NISS