Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and...

34
Statistical Issues in Cancer Proteomics Alexander Ploner Medical Epidemiology & Biostatistics Karolinska Institutet http://www.meb.ki.se/aleplo November 2, 2004

Transcript of Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and...

Page 1: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

Statistical Issues in Cancer Proteomics

Alexander Ploner

Medical Epidemiology & BiostatisticsKarolinska Institutet

http://www.meb.ki.se/∼aleplo

November 2, 2004

Page 2: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

1

Overview

• Proteomics & Cancer Proteomics

– Why?– Why is it hard?

• Technology & Data

– How to measure proteins– Issues: biological, technical, statistical

• TOF-MS

• Experiences with the analysis of microarray data and how they maytranslate into working with proteomics data

Page 3: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

2

The Central Dogma

• DNA: What the cell can do

• mRNA: What the cell tries to do

• Protein: What the cell does. . . potentially

Source: http://nobelprize.org/medicine/educational/dna/

Page 4: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

3

Proteins

Functions Enzymes, hormones,growth factors, transcriptionfactors, membrane receptors,transporters, antibodies, . . .

Modifications Phosphorylationof proteins can regulate theirbiological activity

Splice variants Often expressedsub-sequences of a gene(exons) can be selectivelyarranged to yield differentproteins

Structure

Page 5: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

4

The Proteome

• The collection of proteins found in a specific cell type in a particularenvironment OR any useful sub- or superset thereof (human,mitochondrial p.)

• Dynamic range from 1 copy/cell to 106 copies/cell

• The human genome of ca. 30000 genes codes for many more proteins:

– alternative splicing (2 to 5 fold increase)– post-translational modifications (2 to 5 fold increase)– alternative initiation, RNA editing, ribosomal frameshifting, protein

splicing, protein aging, . . .

Page 6: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

5

’Genes were easy ’ – Cancer Proteomics

Of general biological interest:

• Protein inventory of reference

• Cell signaling

• Characterization of active proteins

• Functional protein interactions

• . . .

Clinical perspectives:

• Biomarker discovery for

– early diagnostic– treatment response

• Identification of drug targets

Page 7: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

6

Technologies

• 2d gel electrophoresis

• SELDI/MALDI mass spectrometry

• Protein Chips

– the next big thing– analogous to mRNA chips

• Pre-fractionation

– separate proteins by physical/chemical properties– high-throughput possible

Page 8: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

7

2d Gel Electrophoresis

• Proteins in a sample are separated by isoelectric point and molecularmass within a polyacrylamide gel & stained ⇒ protein expression map

• Technically known since the 70s, high-throughput in combination with

– scanning of expression maps,– digital quantification of spot sizes,– protein identification for interesting spots (through MS)

• Resolution of several 1000 proteins per map

• Active development [8, 6]:

– improved digital processing– combinations with pre-fractionation, mass spectrometry– two-color systems

Page 9: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

8

Page 10: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

9

Issues with 2D Gels

• Biological:

– biased towards high-abundance proteins– no suitable for high mass, high hydrophobia, low copy number

• Technical:

– poor linear response– spot recognition is hard– spot matching between gels is hard

• Practical: quantification requires expertise & time

• Statistical: little systematic treatment

Page 11: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

10

2D Gel Data

• Observations may be heavily censored and/or missing

• Often useful structure:

– percentage of unobserved depends on expression level of observed– observations log nicely– mean/variance relationship

• However, mostly ad-hoc methods, e.g.:

– Breast cancer samples with < 10% unobserved: impute by iterativelyregressing on principal components

– Colorectal cancer samples with 50% unobserved: use missingness toseperate sporadic and familial cases

Page 12: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

11

Mass Spectrometry

• MALDI-TOF MS: Matrix-assisted laser desorption/ionisation-time offlight mass spectrometry

– sample proteins are embedded in energy-absorbent matrix molecules– laser stimulates the matrix to transfer energy to proteins⇒ proteins get ionized

– an electric field hurls the ions down the flight tube– a detector counts impacts in regular time intervals:∗ count ≈ number of proteins∗ time of flight ≈ mass/charge of proteins

• SELDI-TOF MS: surface-enhaced laser desorption/ionisation-time offlight mass spectrometry

– uses specially treated protein chips that bind a specific class ofprotein

Page 13: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

12A data-analytic strategy for protein biomarker discovery 451

Laser

Flight tube

1234

678

5

Detector

Mass/Charge

(+)

(+)

(+)

Data Analysis Workstation

(+) (+) (+)

Laser1234

678

5

Detector

Mass/Charge

Desorption/Ionization

(+)

(+)

(+) (+) (+) (+)

Fig. 1. The outline of the Surface-Enhanced Laser Desorption/Ionization (SELDI) ProteinChip technology.

subsequent stages of the biomarker discovery/validation process. For this purpose, we retrieved serumsamples of 386 subjects, approximately equally distributed across four groups: late-stage prostate cancer(N = 98), early-stage prostate cancer (N = 99), BPH (N = 93), and normal controls (N = 96), storedin the serum repository of the EVMS Virginia Prostate Center. Late-stage prostate cancer patients had abiopsy-proven cancer staged C or D, and Prostate Specific Antigen (PSA) concentrations greater than 4ng/ml. Early-stage prostate cancer patients had a biopsy-proven cancer staged A or B, and PSA greaterthan 4 ng/ml. BPH patients had PSA values between 4 ng/ml and 10 ng/ml, low PSA velocities, and atleast two negative biopsies. Normal controls were aged 50 or older, corresponding to the age range ofcancer and BPH patients, with a PSA level less than 4 ng/ml and normal digital rectal exam.

Each of the retrieved serum samples was assayed at the EVMS for protein expression by theSELDI ProteinChip Array technology of Ciphergen Biosystems Inc. (Wright et al., 1999; Rubin andMerchant, 2000; Adam et al., 2001; Srinivas et al., 2001). The SELDI technology is a time-of-flight massspectrometry, shown in Figure 1, with a special ProteinChip Array whose surface captures proteins usingchemically or biologically defined protein docking sites. Proteins are captured on the chip surface, purifiedby washing the surface, and crystallized with small molecules called ‘matrix’ or ‘energy-absorbingmolecules’ (EAMs) whose function is to absorb laser energy and transfer it to proteins. Energized proteinmolecules fly away from the surface into a time-of-flight tube where the time for the molecules to flythrough the tube is a function of the molecular weight and charge of the protein. A detector at the end ofthe tube measures the ‘intensity’ of proteins at each discrete time of flight and outputs about 48 000 datapoints of (time of flight, intensity) pairs. Each discrete time of flight corresponds uniquely to a ratio ofthe molecular weight of a protein to the number of charges introduced by the ionization. SELDI output,therefore, produces about 48 000 data points of (mass/charge, intensity) pairs. Our analyses used 11 175data points per sample covering the mass/charge range of 1500–20 000. The lower limit of the range wasplaced at 1500 because intensity measures of proteins below this limit are distorted by those of EAMs.Twenty thousand was considered to be a reasonable upper limit of the range where the SELDI assays aremost sensitive. Note that PSA, a 28 741-Da glycoprotein that is currently used in the screening of prostatecancer, is out of the mass/charge range of 1500–20 000 unless the molecules were doubly charged (anuncommon event).

Source: [12]

4000 6000 8000 10000

05

1525

Mass/Charge

Inte

nsity

Page 14: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

13

Workflow

Sample collection ⇒ Run samples (1/many chips)

Normalization/Calibration

Peak selection ⇐ Peak detection

Protein identification

Page 15: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

14

This is really about:

Statistical Challenges in Biomarker Discovery forCancer Using Time-of-Flight Mass Spectrometry

Page 16: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

15

Biomarkers

• Proteins that are somehow associated with incidence, treatmentresponse, or outcome of the disease

• Current biomarkers:

– e.g. PSA for prostate cancer– usually based on disease mechanism and/or antibody screening– weak (low sensitivity/specificity)

• In future hopefully:

– systematic exploration of possible markers– reliable panels of markers

Page 17: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

16

Biomarkers for Early Cancer Detection

• This is HARD when aimed at population screening:

– detection before clinical onset– very high specificity required– fairly non-invasive method of sample collection

• Somewhat easier for screening high-risk groups

• Suitable biological material:

– urine (bladder c)– nipple aspirate fluid (breast c)– serum – hopefully general⇒ Human Plasma Proteome Project (at HUPO)

• [4] suggest a five-step framework for systematic discovery

Page 18: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

17

Proteomics, Ovarian Cancer, and A New Paradigm?

Petricoin et al., 2002 [7]:

• Used a genetic algorithm (combined peak selection/detection) toestablish a predictive pattern of peaks for 50 cases and 50 controls

• Validation on 116 additional samples:

– 100% sensitivity– 95% specificity

• ’. . . a new diagnostic paradigm’

Page 19: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

18

Not yet . . .

• Ad Petricoin et al.: [9, 1]

– greatest difference between cases and controls outside ofmeasurement range (< 500kD)

– sample processing causes blatant changes– separating features not reproducible across experiments– suspect mass calibration

• A general review of SELDI profiling papers finds [3]

– discrepancies between studies,– failure to reproduce known biomarkers,– re-discovery of old biomarkers discarded for lack of sensitivity and

specificity

Page 20: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

19

Workflow for Biomarkers/SELDI

(Hypothesis) ⇒ Experimental Design ⇒ Sample Collection

Peak Detection ⇐ Normalization ⇐ Quality Control I

Peak Selection ⇒ (Cross-) Validation ⇒ Protein Identification

Quality Control II

Page 21: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

20

Experimental Design

• Identification of potential confounders in sample collection andprocessing

• Protein chip blocking structure

• Technical replication

• Pre-fractionation

Page 22: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

21

Normalization

• Remove techhnical variation unrelated to biological sample properties

• Something has to remain constant between samples

• Mostly ad-hoc:

– total ion current,– area under curve,– baseline

Page 23: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

22

Peak Detection

• aka feature selection, aka variable selection

• Mostly ad hoc:

– spectrum-by-spectrum,– based on local extremes & smoothing

• Suggestion K. Coombes: smooth mean spectra via waveletdecomposition & do peak detection there

• Project at MEB:

– split spectra into small windows (ca. 5 consecutive observations)– compute a smoothed F-statistic with spectrum as grouping factor

for each window– identify discriminatory peaks via false discovery rate

Page 24: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

23

Peak selection

• aka discrimination, aka biomarker (pattern) idenfication

• Often a univariate pre-selection step

• Conventional statistical methods have been employed:

– linear discrimination,– quadratic disrimination,– logistic regression with variable selection

• Numerous machine learning methods have been employed:

– classification trees,– neural networks,– support vector machines,

Page 25: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

24

– k nearest neighbors,– bagging & boosting of all of the above

• Interestingly enough:

– conventional statistical tools usually do not do much worse than themachine learning approaches,

– modern statistical tools adapted for microarray data have not beenused (reduced rank discrimination, empirical Bayes methods)

Page 26: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

25

Cross-Validation

• Classification on the original data is biased towards the original data*specifically* for high-dimensional data

• Validation: apply the prediction to a new data set, optimally.

– split into test & trainings set,– fit the model to the trainings set,– predict the test set for an unbiased error estimate

• Less wasteful: average error rates over repeated random partitions(training/test) of the data (in extremis: leave-one-out)

• It still needs stressing [10]: any informative pre-selection of peaks mustalso be cross-validated.

Page 27: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

26

• A useful baseline: crossvalidate for a randomly shuffled classification.

• All this comes much more natural to machine learning than to statistics

Page 28: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

27

Prediction & Quality Control II

• After establishment of a panel of biomarkers, status of new sampleswill be predicted

• New samples need to be assessed whether they can be reasonably beassumed to come from the original population

• [2] suggest:

– establish a reference sample and its properties– spot the reference sample along with the new samples– use principal components & Mahalanobis distance to identify

aberrant chips

Page 29: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

28

Microarrays and Proteomics

Page 30: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

29

On our way to a full Omic Analysis?

(Genome, Proteome, Transcriptome, Metabolome . . . )

Depends:

• Only if we connect all available biological data can we hope to model &understand biological activity on the molecular level in a comprehensiveway.⇒ Doh

• All omics data is exchangeable & can be fed into the same black box.⇒ I don’t think so . . .

Page 31: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

30

Some useful reading

• Useful review issues:

– Proteomics: Nature Insight Overview, Vol. 422, 2003– Processing: Proteomics 3, 2003 (First Annual Proteomics Data

Mining Conference)

• The ovarian cancer study with follow-up: [7, 9, 1, 3]

• Applications & comparisons: [12, 11, 10]

• Machine learning for statisticians: [5]

Page 32: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

31

References

[1] K.A. Baggerly, J.S. Morris, and K.R. Coombes. Reproducibility of seldi-tof protein patterns in serum:comapring datasets from different experiments. Bioinformatics, 20(5):777–785, 2004.

[2] K.R. Coombes, H.A. Fritsche Jr., C. Clarke, J.-N. Chen, K.A. Baggerly, J.S. Morris, L.-C. Xiao, M.C.Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from nippleaspirate fluid by surface-enhanced laser desorption and ionization. Clin. Chem., 49(10):1615–1623,2003.

[3] E.P. Diamandis. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool. Molecular& Cellular Proteomics, 3:367–378, 2004.

[4] R. Etzioni, N. Urban, S. Ramsey, M. McIntosh, S. Schwartz, B. Reid, J. Radich, G. Anderson, andL. Hartwell. The case for early detection. Nat. Rev. Cancer, 3, 2003.

[5] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.

Page 33: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

32

[6] T. Kondo, M. Seike, Y. Mori, K. Fuji, T. Yamada, and S. Hirohashi. Application of sensitiveflourescent dyes in linkage of laser microdissection and two-dimensional gel electrophoresis as acancer proteomic study tool. Proteomics, 3:1758–1766, 2003.

[7] E.F. Petricoin III, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills,C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta. Use of proteomic patterns in serum to identifyovarian cancer. Lancet, 359:572–577, 2002.

[8] T. Rabilloud. Two-dimensional gel electrophoresis in proteomics: Old, old fashioned, but it stillclimbs up the mountains. Proteomics, 2:3–10, 2002.

[9] J.M. Sorace and M. Zhan. A data review and re-assessment of ovarian cancer serum proteomicprofiling. BMC Bionformatics, 4:24, 2003.

[10] M. Wagner, D.N. Naik, A. Pothen, S. Kasukurti, R.R. Devineni, B.-L. Adam, O.J. Semmes, andG.L. Wright Jr. Computational protein biomarker prediction: a case study for prostate cancer. BMCBionformatics, 5(26), 2004.

[11] B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, and H. Zhao.Comparisons of statistical methods for classification of ovarian cancer using mass spectrometry data.Bioinformatics, 19(13):1636–1643, 2003.

Page 34: Statistical Issues in Cancer Proteomicsfafner.meb.ki.se/.../talks/StatinCancerProt.pdf · Hung, and H.M. Kuerer. Quality control and peak finding for proteomics data collected from

33

[12] Y. Yasui, M. Pepe, M.L. Thompson, B.-L. Adam, G.L. Wright Jr., Y. Qu, J.D. Potter, M. Winget,M. Thornquist, and Z. Feng. A data-analytic strategy for protein biomarker discovery: profilinghigh-dimensional proteomic data for cancer detection. Biostatistics, 4(3):449–463, 2003.