De Novo Sequencing and Homology Searching with De Novo Sequence Tags

50
De Novo Sequencing and Homology Searching with De Novo Sequence Tags

description

De Novo Sequencing and Homology Searching with De Novo Sequence Tags. Inexact protein DB. protein DB. Possible Ways to Interpret MS/MS Data. MS/MS Spectra. 2. de novo sequencing. peptides. homology search. database search. 1. 3. peptides. h omologous peptides. Why Bother?. - PowerPoint PPT Presentation

Transcript of De Novo Sequencing and Homology Searching with De Novo Sequence Tags

Page 1: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

De Novo Sequencing and Homology Searching with De Novo Sequence Tags

Page 2: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

de novosequencing

Inexactprotein DB

homologysearch

1

2

Possible Ways to Interpret MS/MS Data

protein DBdatabase search

MS/MS Spectra

3

peptides

peptides

homologous peptides

Page 3: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Why Bother?

• De novo sequencing derives the sequence without looking into a database.

• De novo sequencing is useful for– unsequenced genomes (no protein database)– novel peptides (unmatched spectra after database search)– single amino acid polymorphism– unexpected PTM– database error– validate a database match

Page 4: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Outline

• Basics• Manual De Novo Sequencing• De novo Sequencing Algorithm (PEAKS)• Homology Search with De Novo Tags

Page 5: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Sequence-specific fragment ions

[N-term]-NH-CHR-C---NH2+-[C-term]H+

O

[N-term]-NH-CHR-CO+ + NH2-[C-term] H+

[N-term]-NH=CHR+ + CO a – NH3 or H2Ob – NH3 or H2Oy – NH3 or H2O

(M+2H)+2

b-ion y-ion

a-ion

Page 6: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

6

Non-sequence-specific fragmentations

Page 7: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Why does everyone analyze positively-charged tryptic peptides?

• Usually better sensitivity from positively-charged peptide ions.

• “Mobile protons” protonate peptide bonds and promote b/y fragmentation

• Arg sequesters protons in gas phase• Tryptic peptides typically have 0 -1 Arg• Tryptic peptide ions typically have two protons• Therefore, tryptic peptides usually have b/y ions

• Placing Arg’s at the C-terminus makes it more likely that a complete series of y-ions will be observed.

Page 8: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

MS/MS spectrum of doubly-charged tryptic peptide (one Arg and two protons)

11719.79003906Database search test mix

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850m/z0

100

%

2003_061802 46 (45.418) Sm (SG, 2x3.00); Cm (45:48) 3: TOF MSMS 464.26ES+ 405

Y L Y E I A R

y1

y2

y3y4

y5

y6

y6 y5 y4 y3 y2 y1

L

Y

b2

a2

b2

Page 9: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

MS/MS spectrum of a doubly-charged non-tryptic peptide(two Arg’s and two protons)

4050.02978516500fmol bsa, Asp-N, 2%

50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900m/z0

100

%

2003_092302 6 (16.310) Sm (SG, 2x3.00); Cm (4:6) 2: TOF MSMS 472.76ES+ 77.1

Rel

ativ

e Ab.

(%)

Y S R R H P Ey2

PH

Y

y1

(b6+18)+2

(b5+18)+2

y4y3

y4-17

(M+2H)+2

b4

y6-17

b5-17a5-17

Page 10: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

CID in traps vs quadrupoles

300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

Rel

ativ

e Abu

ndan

ceR

elat

ive A

bund

ance

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 15000

m/z

m/z

y13

y13

b13

b13

y12

y12

y11

y11

b12

y10

y10

b11

b10

y9

y9

b9y8

y8

y7

y7

(M + 2H) +2

b8

y14+2

y14+2

b7

y6

y6

b6

y2

y2

y1

b2

y3 y4y5

y3 y4

y5

Ion trap

Qtof

b3

b3

b4

b4

b5

b5b6

IPIGFAGAQGGFDTR

IPIGFAGAQGGFDTR

Page 11: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Annoying things to remember when sequencing peptides by MS/MS

• Leucine and isoleucine have the same mass

• Glutamine and lysine differ by 0.036 u

• Phenylalanine and oxidized methionine differ by 0.033 u

• Cleavages do not occur at every bond (more often than not, there is no cleavage between the first and second residues)

• Certain amino acids have the same mass as pairs of other amino acids: G + G = N, A + G = Q, G + V ~ R, A + D ~ W, S + V ~ W

• However: mass accuracy resolves many of these ambiguities

Page 12: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Outline

• Basics• Manual De Novo Sequencing• De novo Sequencing Algorithm (PEAKS)• Homology Search with De Novo Tags

Page 13: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Two approaches to manually sequencing peptides from MS/MS spectra

1. Finding a series of ions in the middle of the peptide, and working out towards the termini (illustrated using ion trap data)

2. Finding the C-terminus and working towards the N-terminus (illustrated using qtof data)

Page 14: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Sequencing from the middle: look for ion series in the region above the precursor ion (m/z 615)

Page 15: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

An obvious series is the one that involves the more abundant fragment ions (m/z 575, 688, 775, 888, and 987)

L S L V

Page 16: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Another ion series contains pairs separated by 18 Da (water losses)

-18-18

-18

-18

-18

L V E S

Page 17: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Two ion series have been identified in the region above the precursor ion

Problem: Two ion series defining partial sequences LSLV and LVES have been identified, but it is not known if these are y- or b-ions (i.e., the sequence direction is unknown).

Solution: Since ion trap data often exhibits high mass b-ions, check to see if the highest mass ion in either series corresponds to a loss of either Arg or Lys (the usual tryptic C-terminus). If not, check to see if the mass difference corresponds to a dipeptide containing Lys or Arg (it is possible that the b-ion defining the C-terminus is absent).

Calculation: Peptide MW – 17 – fragment ion = C-terminal residue mass

Page 18: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

For the first ion series: 1228 – 17 – 987 = 224 Da

L S L V

224 – 128 = 96224 – 156 = 68Therefore this does not look like a b-ion series

Page 19: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

For the second ion series: 1228 – 17 – 1083 = 128 (the residue mass of Lys); this looks like a b-ion series and maybe the other one is a y-ion

series

-18-18

-18

-18

-18

L V E S K

Page 20: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

The high mass b-series predicts the presence of some low mass y-ions; are they there?

b-series: …LVESKy1: 147 Noy2: 234 Yesy3: 363 Yesy4: 462 Yesy5: 575 Yes!!

b-ionsy-ions

y-ions = residue mass plus 19 Da

Page 21: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

The high mass y-series predicts the presence of some low mass b-ions; are they there?

b-ionsy-ions

y-series: [242]VLSL…b2: 243 Yesb3: 342 Yesb4: 455 Yesb5: 542 Yesb6: 655!! Yes

Page 22: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Can I account for most of the remaining ions as neutral losses or internal fragments?

b-ionsy-ionsneutral loss

[242]VLSLLVESK

242 = N+Q, N+K, L+E

Page 23: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Two approaches to manually sequencing peptides from MS/MS spectra

1. Finding a series of ions in the middle of the peptide, and working out towards one of the termini (illustrated using ion trap data)

2. Finding the C-terminus and working towards the N-terminus (illustrated using qtof data)

Page 24: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 25: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 26: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 27: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 28: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 29: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 30: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 31: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags
Page 32: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Outline

• Basics• Manual De Novo Sequencing• De novo Sequencing Algorithm (PEAKS)• Homology Search with De Novo Tags

Page 33: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Algorithm Design

• The first thing for algorithm design is to define the property of the solution.

• For the de novo sequencing problem, one wants to compute a peptide that “best matches” the given spectrum.

• This “best match” is practically defined by a scoring function.

Page 34: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Peptide-Spectrum Match Score

peptide

prefix suffix

• A fragment score can be computed for every two adjacent amino acids. This score depends on the presence of the corresponding b and y ions.

• The peptide score is the sum of the fragment scores.

Page 35: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

The Fragment Score for a Mass

peptide

prefix suffix

• The fragment score calculation only requires the prefix mass but not the sequence• Note: suffix mass = total residue mass – prefix mass.

• Thus it is possible to define score for each prefix mass value .

𝑚 𝑀−𝑚

Page 36: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

How to Define Statistically• Learn two probabilities from large training data

– : Prob(a peak is observed at a y-ion m/z).– : Prob(a peak is observed at a random m/z).– Usually .

• If an expected y-ion is observed, is added to .– is called the log-likelihood-ratio

• If an expected y-ion is missing, , is added to .• Thus, matching ion is rewarded and missing ion is penalized.• Other fragment ion types can be considered similarly.

Page 37: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

De Novo Sequencing

• For a sequence with prefix masses the peptide score is defined as

• De Novo Seuqenicng: Given scoring function and mass , computes a sequence P with total residue mass , and maximizing .

Page 38: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Algorithm Idea• : the maximized score that can be achieved by a prefix with

mass .

𝑚 (𝑎 )

𝑃

𝑚−𝑚(𝑎)

𝑃 ’𝑎

• If is the best sequence for , then must be the best sequence for . • Thus, .• To compute , try 20 residues and use the one that maximizes the

above formula.

best sequence for

Page 39: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Dynamic Programming

• The algorithm initializes and all other cells to be .• Then computes for from 1 to by

.• The best sequence can be retrieved by a backtracking process by

repetitively computing the last amino acid .

⋯⋯ ⋯⋯BestScore0 1 2 3 ⋯⋯ 𝑀

⋯⋯

𝑚−𝑚(𝑎)

⋯⋯

𝑚

Page 40: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

A Note on PTM

• Variable PTM does not cause major speed slow down for de novo sequencing algorithms.– Instead of trying 20 regular amino acids in the

maximization, the algorithm simply tries all modified amino acids too.

– The time complexity is increased by a constant factor. (Compare to the exponential growth in database search approach).

• However, since the solution space is larger when many variable PTMs are allowed, the accuracy of the algorithm is reduced.

Page 41: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Accounting for Other Ion Types• When internal cleavage ions are considered in the scoring function, it

becomes difficult to design efficient algorithm to find the optimal sequence.

• A compromise between efficiency and accuracy is to employ a two-stage approach.– First, compute many (e.g. 10,000) sequences using an efficient score function

that uses only a few of the most important ions. – Then, evaluate these candidates using a more sophisticated scoring function

additional ions.

• This two-round approach is a tradeoff between the algorithm speed and accuracy.

Page 42: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Mass Segment Error• Most errors are due to incomplete ion ladders in the

spectrum. – Thus, a segment of amino acids cannot be determined. – However, the total mass of the segment, is fixed.– E.g. [242]VLSLLVESK, where 242 = N+Q, N+K, or L+E

• The first two or three residues often have low confidence, because of a lack of fragment ions.

• Most de novo sequencing software uses the precursor mass as a constraint (thus the mass of the derived sequence is usually correct).

Page 43: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Outline

• Basics• Manual De Novo Sequencing• De novo sequencing Algorithm• Homology Search with De Novo Tags

Page 44: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Why Homology Search with De Novo Sequence

• Advantages:– Database may not contain the exact peptide sequence, but

a homologous one is there.– De novo + homology search is great to use the database

of one organism to study a similar organism.

• Disadvantages:– De novo sequence can only provide partially correct

sequence tags.– Conventional homology search may fail due to de novo

sequencing errors.

Page 45: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Traditional Sequence Alignment• Two peptide sequences are aligned by inserting spaces to appropriate

positions. E.g. FVEVTKL-TDLTK | || || ||||| FAEV-KLVTDLTK

• The matching residues (including gaps, ‘-’) in each column has a similarity score that can be looked up in a pre-defined amino acid substitution matrix, such as BLOSUM or PAM.

• The alignment score is equal to the sum of the column-wise scores.

• There are algorithms to compute the optimal alignment that maximizes the alignment score.

Page 46: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

• Conventional search ignores the possible errors in de novo sequencing.

• Suppose a true sequence is SLCAFK, and the de novo sequence is LSCFAK, and the homolog is SLAAFK.

Limitations of Conventional Homology Search

(denovo) X: LSCFAK |(homolog) Z: SLAAFK

(denovo) X: [LS]C[FA]K(real) Y: [SL]C[AF]K || || |(homolog) Z: [SL]A[AF]K

Conventional search using evolutionary similarities to explain the mismatches results in a poor match.

If de novo sequencing errors are considered, the match becomes more significant.

Page 47: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

A Simple Approach

• We can enumerate all possible combinations of a mass segment, and search all of them together.– MS BLAST will do this.

• Difficulties:– Do not know which portion of the sequence is error.– Exponential growth of possibilities.

[LS]C[FA]K

LSCFAKSLCFAKTVCFAKVTCFAKLSCAFKSLCAFKTVCAFKVTCAFK

Page 48: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

SPIDER Model

• Given a de novo sequence X, and a database sequence Z. Try to reconstruct the real sequence Y. – The difference between X and Y is explained by de novo sequencing

errors.– The difference between Y and Z is explained by homology mutations.

• The real Y should minimize the de novo errors and the homology mutations needed in the above explanation.

(de novo) X: [LS]C[FA]K(real) Y: [SL]C[AF]K || || |(homolog) Z: [SL]A[AF]K

Page 49: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Two exercises

(denovo) X: LSCFV(real) Y: EACFV (homolog) Z: DACFV

m(LS)=m(EA)=200.1 Da

(denovo) X: LSCFAV(real) Y: SLCFAV (homolog) Z: SLCF-V

blosum62

•The swap of L and S is more likely a de novo error than a mutation. •The deletion of A is unlikely a de novo error (de novo does not change peptide mass).

• Mutation and de novo error overlap. Hard for manual interpretation. Algorithm is needed.

Page 50: De Novo Sequencing and Homology  Searching  with De Novo Sequence  Tags

Conclusion

• When the target peptides are not in a database.– De novo sequencing

• When the homologous peptides are in database – Homology search with the de novo tags can find

them– Some de novo errors can be corrected by

combining the homolog information