Post on 26-Dec-2015
Statistical Consideration for Identification and Quantification
in Top-Down Proteomics
Richard LeDuc National Center for Genome Analysis Support
Discovery Omics withTop Down Proteomics
Acknowledgements
• Leonid Zamdborg• Shannee Babai• Bryon Early• Ian Spauling• Kevin Glowacz• Eric Bluhm• Vinayak Viswanathan• Yong-Bin Kim• Ryan Fellers• Tom Januszyk• Brian Cis• Chris Strouse• Seyoung Sohn• Greg Taylor• Joe Sola• Lee Bynum• Andrew Birck
All the other numerous members of the KRG who have contributed insights over the years.
Drs. Neil Kelleher, Paul Thomas, and Andy Forbes, and ProSight Development Team (past and present)
Yury Bukhman, James McCurdy, Adam Halstead, Irene Ong (Area 3), Mary Lipton (PNNL), Kathryn Richmond (Enabling Technologies) and others
Proteomics CoreReid TownsendPetra GilmoreCheryl LichtiJames MaloneAlan Davis
Michael Gross (NCRR Mass Spec)Henry Rohrs (NCRR Mass Spec)Ron Bose (Oncology)Mike Boyne (FDA)Jeffry Hiken (Genetics)
Le-Shin Wu, Carrie Ganote,Tom Doak,Bill Barnett, and a cast of thousands
Limbrick LaboratoryDavid LimbrickDiego Morales
Holtzman LaboratoryDavid HoltzmanRick PerrinJacqueline PaytonChengjie Xiong (Biostatistics)
National Center for Genome Analysis Support
Washington Univ. School of Medicine
The Kelleher Research Group
Differential Omics Studies1. RNA-seq, Bottom-up
proteomics, metabolomics
2. Looking for a list of discovered entities that have different expression levels between treatments
3. Very popular for target discovery
4. Frequently done on organisms before a genome is completed
‘P score’ = Pf,n =(xf)n x e-xf
n!
f is the # of input fragment ions,
n is the # of matches,
Ma is the Mass Accuracy
2211.111
1 aMx
1
0
1n
ifn
i
nifncrude ppp
F. Meng, B. Cargile, L. Miller, J. Johnson, and N. Kelleher, Nat. Biotechnol., 2001, 19, 952-957.
“Kelleher P-Score” Example
Top Down Proteomics!• Three pillars of proteomics:
• Identification• Characterization• Quantification.
• Top down proteomic studies are underway.
• These are large and complex studies(At several institutions, a typical production bottom-up study would have 200+ LC runs)
Top Down Proteomics BiometricsSources of
1. Intensity calculation
2. LC alignment
3. Mass Spec Physics
4. SeparationDifferent fractions etc.
5. Protein IsolationChIP, RBC ghosts etc
6. Tissue variation
7. Individual variation
8. Population variation
9. Random and systemic errors
Experimental Design
• Ronald A. Fisher (1926) : "The Arrangement of Field Experiments“
• All measurements have errors• All biological systems have
individual variation
• The goal of experimental design is to design the experiment so that the variation can be partitioned
• Typically testing variation between groups against the variation within
Healthy Group
Sub 1 Sub 2
R1 R2 R3 R4 R5 R7R6 R8
Diseased Group
Sub 1 Sub 2
R1 R2 R3 R4 R5 R7R6 R8
Control Samples PNH Samples
1
0
-1
-2
-3
3
2
1 642 8 10 1211 133 5 7 9
RAP1A
Coomassie
Catalase
Peroxiredoxin
250150100755037
2025
2025
10075
50
CON_
1CO
N_6
PNH_
9PN
H_10
Typical Results: Human RBC Ghosts
Control Samples
PNH Samples
RAP1A
Populations of Experiments• Instead of doing 1
experiment, you are doing an unknown number of experiments
• Number of experiments determined by how many unique entities are observed consistently over the entire set of observations
Control Samples
PNH Samples
Sources of Variation: The Model
ijklijkijiijkl erdaI )()(Wherei=1 or 2 and represents the two preparations,j = 1 to 3 for each digestion within a given
preparation,k = 1 to 3 for each injection (or run) within each
digestionl = 1 to the number of peptides for the given protein.
Under this model, let
residuals theis e
npreperatio i thefrom digestion random j thefrom run random k for theeffect theis r
npreperatio i thefrom digestion random j for theeffect theis d
npreparatio random i for theeffect theis a
ijkl
thththk(ij)
ththj(i)
thi
Power Calculations
Hum
an
Sub
ject
s
Power Curves for High Subject Low Residual Variation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5Effect Size (in STD)
Pow
er
n=20
n=5
Inbreed Mice
Systems Analysis• What to do with the laundry lists
of significant genes?• Gene Ontology Analysis• Gene Set Enrichment Analysis
• Often paired with RNA or metabolomic data.
• Creates a third level of analysis
To Review
• Everything is in place for top-down proteomic studies.
• In any discovery omic study, extreme care must be taken – lots of pilot work to understand the behavior of your analytic system
• Technology and mathematical formalism does not trump biology. (Bad experimental design results in bad experiments)
• Funded by National Science Foundation1. Large memory clusters for assembly
2. Bioinformatics consulting for biologists
3. Optimized software for better efficiency
• Partner Institutions:• Extreme Science and Engineering Discovery Environment (XSEDE)• Texas Advanced Computing Center (TACC) at the University of Texas at
Austin• San Diego Supercomputer Center (SDSC) at the University of California, San
Diego.• Pittsburgh Supercomputing Center (PSC)
• Open for business at: http://ncgas.org
Sh
amel
ess
NC
GA
S P
lug
Questions?