The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

Post on 30-Sep-2020

2 views 0 download

Transcript of The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

The Bioinformaticsof Protein Modification

(Part 2)

Vorlesung 4610Universität Basel

http://www.biozentrum.unibas.ch/lectures.html

Dr. Michael Rebhan, Friedrich Miescher Institute,

Basel, January 2006

www.fmi.ch

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues

- building your own motif

4. Related topics:- protein function- mutation effects

5. Online Materials: Exercises, Links

Part 2

Predicting modification sites:

Building Your Own Motif:

1. Building the data set

2. Alignment

3. Analysis of the alignment

4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

- Specialized datasets? → online materials

(PubMed, Google)

Keep in mind:- how reliable is the data? (direct evidence?)- importance of the sequence environment around the mainmotif (see part 1)→ can reduce false positive rate

Eisenhaber et al(2004) Proteomics 4, 1614-1625.Prediction of sequence signals for lipid post-translational modifications: Insights from case studies

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”(= C-mannosylation)

→ only those with direct exper. evidence!(is the dataset large & diverse enough?)

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”

Features look OK→ query is OK(no preditions etc.)

Now get more info,incl. sequence environment

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”

Back to the query form:

Retrieve entry instead of feature, and displaykey fields in output.

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

Example: “C-linked (man)” in the “feature descriptions”

Why 11? We had 49features before?

(each entry (=protein)can carry a number offeatures (=modifications))

Click on the entry link…(if you’d like to include this protein)

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Collect all relevant sequences:Your own + Public

1. Find the featuresyou’d like to include in the data set (“training set”)

2. Click on its positionto get thesequence context

3. Build the alignment in FASTA format(by copy & paste, if it’s a small set)

4. Import into alignment viewers(like Jalview, www.jalview.org)

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Analysis of the alignment / data set:- any corrections needed, esp. gaps?- is it large/diverse enough?- sorting, try different color views:

In Jalview: By conservation:- which positions showclear constraints?

→ motif boundaries

Other constraints:

- conserved? (“BLAST”)- secondary

structure, accessibility?(Quick2D, SABLE)

… see part 1

Color: Zappo

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

Regular expressions:

[WDMLYSFHQ]-[TGSAYF]-[QSGCTNEPA]-W-[TGSAI]-[SCGPTVEDQ]-[CW]-[SGEDRANTF]

or: W-X-X-[CW] (in S-rich env.)

→ could be useful, but doesn’t impose a lot of constraints(and no scoring…)

If you’d like to use it anyway, you can scan proteindatabases with this motif at ScanProsite (ExPASy)…

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

Which kind of model to use?

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

ScanProsite:

→ enter pattern, options

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

ScanProsite results:

More: online materials

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST!- support vector machines (SVMs)

Search with the alignment using PSI-BLAST, e.g. at the Bioinformatics Toolkit (MPI Tuebingen)→ PSSM profile (see part 1)

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

First against SWISSPROT to check which proteins get the highest scores

→ e value: 1000, ungapped alignment

“Validation” / filtering:- Quick2D: secondary structure, disorder- conservation (?)

Also: ScanSite (MIT)!(enhanced regular expressions and PSSM search)

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

SVMs: training data → function(classification / regression)

AutoMotif server (using SVMs)

Need:- reformat sequences (with a simple

replace, e.g. in WordPad)- register at the AutoMotif site (immediate)- submit reformatted alignment & search

For classification, SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examplesfrom the negative examples. The split will be chosen to have the largest distancefrom the hypersurface to the nearest of the positive and negative examples.

Predicting modification sites: Building your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

My dataset is very small and not very diverse– anything I can do?

Collecting & aligning orthologs:

1. Check SWISSPROT for “by similarity” features, and, if that’s not enough,use myHits (SIB) to collect orthologs with considerable variation

(lots of flanking sequence, use 90% identity clustering, againstSWISSPROT [and Ensembl], E values 1e-6 and 0.01 select clear hits, then “next cycle”, then align trustworthy hits)

2. Trim the alignment in Jalview (e.g. in myHits), sort by pairwise id.

Demo with MARRSVLYFILLNALINKGQACFCDHYAWTQWTSCSKTCNSGTQSRHRQIVVDKYYQENF

Predicting modification sites: Building your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Do all these orthologsstill carry the samemodification?

→ experiments!

Search: PSI-BLAST at MPI(as before)

(this example: 2 C-mannosyl.sites next to eachother)

Which residues are conserved?

Predicting modification sites: Building your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

If there are no substrates at all – anything I can do?

Your have a kinase, by chance?

→ PREDIKIN: potential substrates for different kinds of kinases, based on sequence and type

→ ideas for experiments …

Brinkworth et al. (2003) PNAS 100:74

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

Need advice?

Ask a protein sequence analysis expert

SUMMARYBuilding your own motif

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

• Building your own motif is not as hard as you may think

• The main issue: building a good and informative alignment!

• Motif building & search:

• Regular expressions: ScanProsite

• PSSMs: PSI-BLAST at MPI

• SVMs: AutoMotif

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues - building your own motif

4. Related topics:- protein function prediction- mutation effects

5. Online Materials: Exercises, Links

Overview

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Protein Function Prediction:

Predicting modifications in the context of function prediction

Also:

- Protein isoforms and the prediction of modifications

- Interpretation of potential motifications, e.g. phospho-sites

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Protein function prediction:

Prediction modifications in the context of function prediction

What can be (reliably) predicted from the sequence alone?

• Domain architecture (and signal peptides): → potential molecular interactions→ proteins with similar domain architecture

• Tertiary or secondary structure, disorder & accessibility

• Small motifs: targeting, modifications, transmembrane regions, coiled coils

• Genomic context & phylogenetic occurrence: hints on “functional interactions”

• New predictions are coming out all the time …

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Protein function prediction: our sequence, alternative transcripts

How good/complete is the protein sequencewe want to check?

- is the sequence itself reliable?- is it as complete as we think?- alternative transcripts?

→ Quick check:BLAT at UCSC

In this example (translated ORF):- some exons are missing!

(alternatively spliced)- alternative TSS exists

→ pick a better sequence!(maybe run the predictions on both & compare)

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Protein function prediction:

Predicting modifications in the context of function prediction

Domain architecture, signal peptide & low complexity regions: PFAM, Interpro→ molecular interactions (if you’re lucky), e.g. RNA-binding→ proteins with similar domain architecture (or composition): PFAM, SMART

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Low complexity

Signalpeptide

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Small motifs: targeting, modifications, transmembrane regions

• Modifications → part 1

• Targeting: TargetP (part of ProtFun, see part 1)

• Disorder, secondary structure, coiled coils etc: Quick2D (at MPI)

• Transmembrane regions: TMHMM, also: Quick2D, SABLE

Quick2D output

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Transmembrane Regions: TMHMM (at CBS), in ProtFun

Protein function prediction: Prediction modifications in the context of function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Genomic context & phylogenetic occurrence:

STRING at EMBL:

Which interactions are supported by different methods?

Protein function prediction:

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Protein isoforms and the prediction of modifications

BLAT at UCSC → alternative transcripts → protein isoforms

Also: check SWISSPROT!

Do they show differences in their potential modification sites?(How could that affect function?)

e.g. SWISSPROT:TAU_HUMAN (pos. 30-120)

Protein function prediction:

Interpretation of potential motifications

Predicted phosphorylation sites → protein-protein interactions?

→ ScanSite at MIT (see part 1)

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

SUMMARYPrediction of modification sites in the context of protein function prediction

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

• Prediction of protein modifications is often/best done in the context of protein function prediction (comprehensive protein annotation)

• Many kinds of signals can be found in such sequences, and oftenthey can provide interesting hypotheses

• Any isoform-specific things? (modifications?)

• Functional consequences of the modification? (e.g. phospho-sites)

• Synergy between analyses! (e.g. structure → modification sites → evolution)

Reviews:- F. Eisenhaber (2005) Eurekah Bioscience Collection (at NCBI Books)

and the online “recipe” at http://mendel.imp.univie.ac.at/RECIPE/- J. Bienkowska (2005) Expert Rev. Proteomics 2:129- B. Rost (2003) Cell.Mol.Life Sci. 60:2637

Mutation Effects:

Will a mutation / polymorphism (e.g. SNP) weaken/destroy the potential modification site, or even create a new one?

Example: NetPhosK analysis of p53_HUMAN cancer variants (pos. 151)→ some modification sites disappear, others appear!

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

wt

Blom et al. (2004) Proteomics 4:1633

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues - building your own motif

4. Related topics:- protein function- mutation effects- analysis of mass spectrometry data

5. Online Materials: Exercises, Links

Overview

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

Online Materials: Exercises, Links

1. Protein Function & Structure

2. Modifications: Generic Tools

3. Modification-specific Tools

4. Building Your Own Motif

5. Recommended Materials

6. Exercises

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

http://www.fmi.ch/groups/bioinformatics/ptm/bioinfo.ptm.htm