The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

The Bioinformaticsof Protein Modification

(Part 2)

Vorlesung 4610Universität Basel

http://www.biozentrum.unibas.ch/lectures.html

Dr. Michael Rebhan, Friedrich Miescher Institute,

Basel, January 2006

www.fmi.ch

The Bioinformatics of Protein Modification, Michael Rebhan, FMI, 2006

1. Introduction: what role does bioinformatics play?

2. Mining information related to protein modifications- known modifications- finding proteins with particular modifications

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues

- building your own motif

4. Related topics:- protein function- mutation effects

5. Online Materials: Exercises, Links

Part 2

Predicting modification sites:

Building Your Own Motif:

1. Building the data set

2. Alignment

3. Analysis of the alignment

4. Motif building & search

Predicting modification sites: Building your own motif

1. Building the data set2. Alignment3. Analysis of the alignment4. Motif building & search

Collect all relevant sequences:Your own + Public

- SRS @ ExPASy: SWISSPROT

- Specialized datasets? → online materials

(PubMed, Google)

Keep in mind:- how reliable is the data? (direct evidence?)- importance of the sequence environment around the mainmotif (see part 1)→ can reduce false positive rate

Eisenhaber et al(2004) Proteomics 4, 1614-1625.Prediction of sequence signals for lipid post-translational modifications: Insights from case studies

Example: “C-linked (man)” in the “feature descriptions”(= C-mannosylation)

→ only those with direct exper. evidence!(is the dataset large & diverse enough?)

Example: “C-linked (man)” in the “feature descriptions”

Features look OK→ query is OK(no preditions etc.)

Now get more info,incl. sequence environment

Back to the query form:

Retrieve entry instead of feature, and displaykey fields in output.

Why 11? We had 49features before?

(each entry (=protein)can carry a number offeatures (=modifications))

Click on the entry link…(if you’d like to include this protein)

1. Find the featuresyou’d like to include in the data set (“training set”)

2. Click on its positionto get thesequence context

3. Build the alignment in FASTA format(by copy & paste, if it’s a small set)

4. Import into alignment viewers(like Jalview, www.jalview.org)

Analysis of the alignment / data set:- any corrections needed, esp. gaps?- is it large/diverse enough?- sorting, try different color views:

In Jalview: By conservation:- which positions showclear constraints?

→ motif boundaries

Other constraints:

- conserved? (“BLAST”)- secondary

structure, accessibility?(Quick2D, SABLE)

… see part 1

Color: Zappo

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

Regular expressions:

[WDMLYSFHQ]-[TGSAYF]-[QSGCTNEPA]-W-[TGSAI]-[SCGPTVEDQ]-[CW]-[SGEDRANTF]

or: W-X-X-[CW] (in S-rich env.)

→ could be useful, but doesn’t impose a lot of constraints(and no scoring…)

If you’d like to use it anyway, you can scan proteindatabases with this motif at ScanProsite (ExPASy)…

Which kind of model to use?

- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST- support vector machines (SVMs)

ScanProsite:

→ enter pattern, options

ScanProsite results:

More: online materials

Which kind of model to use?- regular expressions (PROSITE patterns)- profiles, like PSI-BLAST!- support vector machines (SVMs)

Search with the alignment using PSI-BLAST, e.g. at the Bioinformatics Toolkit (MPI Tuebingen)→ PSSM profile (see part 1)

First against SWISSPROT to check which proteins get the highest scores

→ e value: 1000, ungapped alignment

“Validation” / filtering:- Quick2D: secondary structure, disorder- conservation (?)

Also: ScanSite (MIT)!(enhanced regular expressions and PSSM search)

SVMs: training data → function(classification / regression)

AutoMotif server (using SVMs)

Need:- reformat sequences (with a simple

replace, e.g. in WordPad)- register at the AutoMotif site (immediate)- submit reformatted alignment & search

For classification, SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examplesfrom the negative examples. The split will be chosen to have the largest distancefrom the hypersurface to the nearest of the positive and negative examples.

My dataset is very small and not very diverse– anything I can do?

Collecting & aligning orthologs:

1. Check SWISSPROT for “by similarity” features, and, if that’s not enough,use myHits (SIB) to collect orthologs with considerable variation

(lots of flanking sequence, use 90% identity clustering, againstSWISSPROT [and Ensembl], E values 1e-6 and 0.01 select clear hits, then “next cycle”, then align trustworthy hits)

2. Trim the alignment in Jalview (e.g. in myHits), sort by pairwise id.

Demo with MARRSVLYFILLNALINKGQACFCDHYAWTQWTSCSKTCNSGTQSRHRQIVVDKYYQENF

Do all these orthologsstill carry the samemodification?

→ experiments!

Search: PSI-BLAST at MPI(as before)

(this example: 2 C-mannosyl.sites next to eachother)

Which residues are conserved?

If there are no substrates at all – anything I can do?

Your have a kinase, by chance?

→ PREDIKIN: potential substrates for different kinds of kinases, based on sequence and type

→ ideas for experiments …

Brinkworth et al. (2003) PNAS 100:74

Need advice?

Ask a protein sequence analysis expert

SUMMARYBuilding your own motif

• Building your own motif is not as hard as you may think

• The main issue: building a good and informative alignment!

• Motif building & search:

• Regular expressions: ScanProsite

• PSSMs: PSI-BLAST at MPI

• SVMs: AutoMotif

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues - building your own motif

4. Related topics:- protein function prediction- mutation effects

Overview

Protein Function Prediction:

Predicting modifications in the context of function prediction

- Protein isoforms and the prediction of modifications

- Interpretation of potential motifications, e.g. phospho-sites

Protein function prediction:

Prediction modifications in the context of function prediction

What can be (reliably) predicted from the sequence alone?

• Domain architecture (and signal peptides): → potential molecular interactions→ proteins with similar domain architecture

• Tertiary or secondary structure, disorder & accessibility

• Small motifs: targeting, modifications, transmembrane regions, coiled coils

• Genomic context & phylogenetic occurrence: hints on “functional interactions”

• New predictions are coming out all the time …

MARRSVLYFI LLNALINKGQ ACFCDHYAWT QWTSCSKTCN SGTQSRHRQI VVDKYYQENF CEQICSKQET RECNWQRCPI NCLLGDFGPW SDCDPCIEKQ SKVRSVLRPS QFGGQPCTEP

Protein function prediction: our sequence, alternative transcripts

How good/complete is the protein sequencewe want to check?

- is the sequence itself reliable?- is it as complete as we think?- alternative transcripts?

→ Quick check:BLAT at UCSC

In this example (translated ORF):- some exons are missing!

(alternatively spliced)- alternative TSS exists

→ pick a better sequence!(maybe run the predictions on both & compare)

Predicting modifications in the context of function prediction

Domain architecture, signal peptide & low complexity regions: PFAM, Interpro→ molecular interactions (if you’re lucky), e.g. RNA-binding→ proteins with similar domain architecture (or composition): PFAM, SMART

Low complexity

Signalpeptide

Protein function prediction: Prediction modifications in the context of function prediction

Small motifs: targeting, modifications, transmembrane regions

• Modifications → part 1

• Targeting: TargetP (part of ProtFun, see part 1)

• Disorder, secondary structure, coiled coils etc: Quick2D (at MPI)

• Transmembrane regions: TMHMM, also: Quick2D, SABLE

Quick2D output

Transmembrane Regions: TMHMM (at CBS), in ProtFun

Genomic context & phylogenetic occurrence:

STRING at EMBL:

Which interactions are supported by different methods?

Protein isoforms and the prediction of modifications

BLAT at UCSC → alternative transcripts → protein isoforms

Also: check SWISSPROT!

Do they show differences in their potential modification sites?(How could that affect function?)

e.g. SWISSPROT:TAU_HUMAN (pos. 30-120)

Interpretation of potential motifications

Predicted phosphorylation sites → protein-protein interactions?

→ ScanSite at MIT (see part 1)

SUMMARYPrediction of modification sites in the context of protein function prediction

• Prediction of protein modifications is often/best done in the context of protein function prediction (comprehensive protein annotation)

• Many kinds of signals can be found in such sequences, and oftenthey can provide interesting hypotheses

• Any isoform-specific things? (modifications?)

• Functional consequences of the modification? (e.g. phospho-sites)

• Synergy between analyses! (e.g. structure → modification sites → evolution)

Reviews:- F. Eisenhaber (2005) Eurekah Bioscience Collection (at NCBI Books)

and the online “recipe” at http://mendel.imp.univie.ac.at/RECIPE/- J. Bienkowska (2005) Expert Rev. Proteomics 2:129- B. Rost (2003) Cell.Mol.Life Sci. 60:2637

Mutation Effects:

Will a mutation / polymorphism (e.g. SNP) weaken/destroy the potential modification site, or even create a new one?

Example: NetPhosK analysis of p53_HUMAN cancer variants (pos. 151)→ some modification sites disappear, others appear!

Blom et al. (2004) Proteomics 4:1633

3. Predicting modification sites in proteins:- general concepts- filtering and interpretation- generic tools- modification-specific tools and issues - building your own motif

4. Related topics:- protein function- mutation effects- analysis of mass spectrometry data

Overview

Online Materials: Exercises, Links

1. Protein Function & Structure

2. Modifications: Generic Tools

3. Modification-specific Tools

4. Building Your Own Motif

5. Recommended Materials

6. Exercises

http://www.fmi.ch/groups/bioinformatics/ptm/bioinfo.ptm.htm

The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

Documents

Transcript of The Bioinformatics of Protein Modification · SVMs: training data →function (classification /...

UXO Sector Evaluation Final Report reformat

Support Vector Machines (SVMs). Semi-Supervised Learning.ninamf/courses/601sp15/slides/18_svm-ssl_03-25-2015... · • Support Vector Machines (SVMs). • Semi-Supervised SVMs. ...

Introduction to SVMs

SVMs, Part 2 Summary of SVM algorithm Examples of custom kernels Standardizing data for SVMs Soft-margin SVMs.

Cat SVMS 2068

Reformat corework series

Linear Classifiers/SVMs

An Idiot’s guide to Support vector machines (SVMs) · machines (SVMs) R. Berwick, Village Idiot SVMs: A New Generation of Learning Algorithms •Pre 1980: –Almost all learning

Classification ( SVMs / Kernel method)

How to reformat pc

SPM 2008-teknologi automotif

10/18/2015 1 Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

Jadual Spesifikasi Peralatan Tangan Bengkel Automotif

Welcome to SVMS Welcome to SVMS Open House 2014 Cougar Team.

Jan Lokayukta Bill proposal (Praja Reformat)

The copyright © of this thesis belongs to its rightful author …dipilih. Populasi kajian merupakan eksekutif syarikat automotif ini. Daftar eksekutif syarikat automotif dipilih sebagai

Jan Lokpal Bill proposal (Praja Reformat)

Dasar automotif Nasional (NAP)

9.520 Class 06, 24 February 2003 Ryan Rifkin...Ryan Rifkin Plan Regularization derivation of SVMs Geometric derivation of SVMs Optimality, Duality and Large Scale SVMs SVMs and RLSC:

Con - SVMs