Improving the Sensitivity of Peptide Identification for Genome Annotation

Improving the Sensitivityof Peptide Identification for Genome Annotation

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular Biology

Georgetown University Medical Center

2

Why Tandem Mass Spectrometry?

MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.

Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to small sequence variations

3

Mass Spectrometer

Ionizer

Sample

+_

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

4

Mass Spectrum

5

Mass is fundamental

6

Sample Preparation for MS/MS

Enzymatic Digestand

Fractionation

7

Single Stage MS

MS

8

Tandem Mass Spectrometry(MS/MS)

Precursor selection

9

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

10

Unannotated Splice Isoform

Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003.

LIME1 gene: LCK interacting transmembrane adaptor 1

LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.

Multiple significant peptide identifications

11


12


13

Splice Isoform Anomaly

Human erythroleukemia K562 cell-line Depth of coverage study Resing et al. Anal. Chem. 2004. Peptide Atlas A8_IP

SALT1A2 gene: Sulfotransferase family, cytosolic, 1A

2 ESTs, 1 mRNA mRNA from lung, small cell-cancinoma sample

Single (significant) peptide identification Five agreeing search engines PepArML FDR < 1%. All source engines have non-significant E-values

14


15


16

Translation start-site correction

Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane

and soluble cytoplasmic proteins Goo, et al. MCP 2003.

GdhA1 gene: Glutamate dehydrogenase A1

Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0

prediction(s)

17

Halobacterium sp. NRC-1ORF: GdhA1

K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated

translation start site of NP_279651

0 40 80 120 160 200 240 280 320 360 400 440

18

Translation start-site correction

19

Phyloproteomics

Tandem mass-spectra of proteins (top-down)

High-accuracy instrument (Orbitrap, UMD Core)

Proteins from unsequenced bacteria matching identical proteins in related organisms

Demonstration using Y.rohdei.

20

Phyloproteomics

21

Phyloproteomics

phylogeny.fr – "One-Click"

Protein Sequence 16S-rRNA Sequence

22

Shared "Biomarker" Proteins

23

Phyloproteomics

Recent extension to highly homologous proteins in related organisms Merely require N- and/or C-terminus in common Broadens applicability considerably

Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced.

New paradigm for phylogenetic analysis?

24

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

25

Searching under the street-light…

Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!

26

All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs

30-40 fold size, search time reduction Formatted as a FASTA sequence database One entry per gene/cluster.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887

Rat 76Mb 42,372Zebra-fish 94Mb 40,490

27

We can observe evidence for…

Known coding SNPs Unannotated coding mutations Alternate splicing isoforms Alternate/Incorrect translation start-sites Microexons Alternate/Incorrect translation frames

…though it must be treated thoughtfully.

28

PeptideMapper Web Service

I’m Feeling Lucky

29


I’m Feeling Lucky

30


I’m Feeling Lucky

31


Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible

Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact

Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates

32

Comparison of search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

38%

14%28%

14%

3%

2%

1%

X! Tandem

SEQUESTMascot

33

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

34

Supervised Learning

35

Unsupervised Learning

36

Peptide Atlas A8_IP LTQ Dataset

37

Running many search engines

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially

modifications and protein identifiers

38

Peptide Identification Meta-Search Simple unified search

interface for: Mascot, X!Tandem,

K-Score, OMSSA, MyriMatch

Automatic decoy searches

Automatic spectrumfile "chunking"

Automatic scheduling Serial, Multi-

Processor, Cluster, Grid

39

NSF TeraGrid1000+ CPUs

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).


MyriMatch.

PepArML Meta-Search Engine

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

40

PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs


Securecommunication


Single, simplesearch request

Scales easily to 250+ simultaneous

searches




MyriMatch.

41



UMIACS250+ CPUs


Securecommunication


Simple searchrequest

42



UMIACS250+ CPUs


Securecommunication


Simple searchrequest

43

Peptide Identification Grid-Enabled Meta-Search

Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community

and get access to others’ cycles in return.

Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.

44


UMIACS250+ CPUs





MyriMatch.

X!Tandem,KScore,OMSSA.

45


UMIACS250+ CPUs





MyriMatch.


46


UMIACS250+ CPUs





MyriMatch.


UMD Proteomics CoreScheduler & 2 CPUs

47


UMIACS250+ CPUs





MyriMatch.


UMD Proteomics CoreScheduler & 2 CPUs

48

Conclusions

Improve the scope and sensitivity of peptide identification for genome annotation, using

Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search

http://edwardslab.bmcb.georgetown.edu

49

Acknowledgements

Dr. Catherine Fenselau & students University of Maryland Biochemistry

Dr. Yan Wang University of Maryland Proteomics Core

Dr. Art Delcher University of Maryland CBCB

Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science

Funding: NIH/NCI

Improving the Sensitivity of Peptide Identification for Genome Annotation

Documents

Transcript of Improving the Sensitivity of Peptide Identification for Genome Annotation