Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

68
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004

Transcript of Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Page 1: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Identification of Protein DomainsEden DrorMenachem Schechter

Computational Biology Seminar 2004

Page 2: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Overview

• Introduction to protein domains.– Classification of homologs.

• Representing a domain.– PSSM– HMM

• Internet resources– Pfam– SMART– PROSITE– InterPro

• Research example.

Page 3: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Protein domains

• A discrete portion of a protein assumed to fold independently, and possessing its own function.

• Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.

Page 4: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Protein domains

• The assumption: The domain is the fundamental unit of protein structure and function.

• Protein family – all proteins containing a specific domain.

Page 5: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

What can we learn from them?

• Common ancestors & homology information of a set of proteins.

• Homology can induce properties of a protein like functionality & localization.

• Therefore, domains can be used to classify a new protein to a family, inferring functionality.

Page 6: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Classification of homologs

• Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes.

• Homologous genes can be derived by two major ways: – Gene duplication (in the same species).– Speciation (splitting of one species into

two).

Page 7: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Classification of homologs

Page 8: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Classification of homologs

• Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.

• Paralogs – Two genes that derive from a single gene that was duplicated within a genome.

Page 9: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Classification of homologs

para

para

ortho

ortho

Page 10: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Classification of homologs

• Inparalogs - paralogs that evolved by gene duplication after the speciation event.

• Outparalogs - paralogs that evolved by gene duplication before the speciation event.

Page 11: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Classification of homologs

out-para

In-para

In-para

When comparing human with worm

Page 12: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

What can we learn from them?

• Ortholog proteins are evolutionary, and typically functional counterparts in different species.

• Paralog proteins are important for detecting lineage-specific adaptations.

• Both of them can reveal information on a specific species or a set of species.

Page 13: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Protein domains – summary

• By identifying domains we can:

– infer functionality & localization of a protein.

– Learn on a specific species.– Learn on a set of species as a group.

Page 14: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Domain representation

• Different methods to represent (model) domains:

• Patterns (regular expressions).• PSSM (Position specific score matrix).• HMM (Hidden Markov model).

Page 15: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PSSM

• Position specific score matrix

• Score matrix representing the score for having each amino acid in a given position in a specific sequence.

• Based on the independent probabilities P(a|i) of observing amino acid a in position i.

Page 16: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PSSM: Example

Page 17: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PSSM: Identifying a domain

• Given a sequence and a PSSM:

• Run over all positions.• Score each sub-sequence according to

the matrix.

Page 18: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Hidden Markov Model

• Markov model: a way of describing a process that goes through a series of states.

• Each state has a probability of transitioning to the other states.

• xi is a random variable of state.x1 x2 x3 x4

Page 19: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Markov Model

• Example:• States are {0,1}

x1 =0 x2 =0 x3 =0 x4 =0

x1 =1 x2 =0 x3 =0 x4 =1

x1=0 x2=1 x3 =1 x4 =1

Page 20: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Markov Model

)|(

8.02.0

4.06.0)(

1 ixjxPa

aA

kkij

ij

• Transition matrix:

x1 x2 x3 x4

x

Page 21: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Markov Model

• State transition example:• States are the nucleotides A, T, G, C.

Page 22: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Hidden Markov Model

• Hidden Markov model:• Each state x emits an output y, at a

specific probability.• We only know the output

(observations).• Thus, the states are hidden.

y1 y2 y3 y4

x1 x2 x3 x4

Page 23: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Hidden Markov Model

• Example: states are {0,1}, output {0,1}

y1 =1 y2 =1 y3 =0 y4 =0

x1 =0 x2 =1 x3 =1 x4 =1

y1 =1 y2 =0 y3 =1 y4 =0

x1 =1 x2 =0 x3 =0 x4 =1

Page 24: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Hidden Markov Model

y1 y2 y3 y4

x1 x2 x3 x4

)|(

15.085.0

9.01.0)(

ixjyPb

bB

kkij

ij

• Emission matrix:

x

y

Page 25: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: What can we do with it?

• Given (A, B):• Probability of given states and outputs

)|()|()|()()( 22121112121 xyPxxPxyPxPyyyxxxP nn

nxx

nnn yyyxxxPyyyP

1

)()( 212121

)|(max 2121 nn yyyxxxP

• Most likely sequence of states that generated a given output sequence

• Probability of a given output sequence

Page 26: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: What can we do with it?

• Learning:

• Given state and output sequences calculate the most probable (A, B).

• Easy when the states are known.

• Otherwise: use a training algorithm.

Page 27: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Profile HMM

• Use HMM to represent sequence families.

• A particular type of HMM suited to modeling multiple alignments.

• (Assume we have a multiple alignment).

Page 28: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Trivial profile HMM

• We begin with ungapped regions.

• Each position corresponds to a state.• Transitions are of probability 1.

Page 29: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Trivial profile HMM

• Let ei(a) be the independent probability of observing amino acid a in position i.

• The probability of a new sequence x, according to the model:

)()|(1

ii

N

ixeMxP

Page 30: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Trivial profile HMM

• We can score the sequence x:

• Where q indicates the probability under a random model.

ix

iiN

i q

xeS

)(log

1

Page 31: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: Trivial profile HMM

• Consider the values

• They behave like elements in a score matrix.

• The trivial profile HMM is equivalent to a PSSM.

ix

ii

q

xe )(log

Page 32: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: profile HMM

• Let’s untrivialize by allowing for gaps: insertions and deletions.

• Start off with the PSSM HMM.

Page 33: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: profile HMM

• Handling insertions:

• Introduce new states Ij – match insertions after position j.

• These states have random emission probabilities.

Page 34: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: profile HMM

• The score of a gap of length k:

jjjjjj IIMIIM akaa log)1(loglog1

Page 35: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: profile HMM

• Handling deletions:

• Introduce silent states Dj.

• These states do not emit.

Page 36: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

HMM: profile HMM

• The complete profile HMM:

Page 37: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Internet resources

• Databases of protein families.• Family information and identification.

• Considerations:– Type of representation (pattern, PSSM,

HMM).– Choice of seed multiple alignment proteins.– Quality control.– Database features (links, annotations,

views).– Database Specificity (organism, functions).

Page 38: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Home

Page 39: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam

• Protein families database of alignments and HMMs

• Uses profile-HMMs to represent families.

• For each family in Pfam you can:– Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures

Page 40: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Databases

2 databases:• Pfam-A – curated multiple alignments.

– Grows slowly.– Quality controlled by experts.

• Pfam-B – automatic clustering (ProDom derived).– Complements Pfam-A.– New sequences instantly incorporated.– Unchecked: false positives, etc.

Page 41: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Features

• Search by: Sequence, keyword, domain, taxonomy.

• Browsing by family or genome.

• Evolutionary tree

Page 42: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Construction

• Source of seed alignments:– Pfam-B families.– Published articles.– 'domain hunting' studies.– occasionally using entries from other

databases (e.g. MEROPS for peptidases).

Page 43: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Domain information

Page 44: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Domain organization

Page 45: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Multiple alignment

Page 46: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: HMM logo

Page 47: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Species distribution

Page 48: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Pfam: Genome comparison

Page 49: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PROSITE

• Database of protein families.

• Matching according to simple patterns or PSSM profiles.

• Browsing all proteins of a specific family.

• Latest release knows 1696 protein families.

Page 50: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PROSITE: Features

• Comprehensive domain documentation.• All profile matches checked by experts.• Specificity/sensitivity:• Specificity: true-pos/all-pos• Sensitivity: true-pos/(true-pos + false-

neg)

Page 51: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

PROSITE: Example

• Specificity of Zinc finger C2H2 type domain

Page 52: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

SMART

Page 53: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

SMART

• Simple Modular Architecture Research Tool

• Identification and annotation of genetically mobile domains and the analysis of domain architectures.

• SMART consists of a library of HMMs.

• Knows 665 HMMs to date.

Page 54: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

SMART: Features

• finding proteins containing specific domains i.e. of the same family

• Function prediction• Sub-cellular localization• Binding partners• Architecture• Alternative splicing information• Orthology information

Page 55: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

SMART: Domain selection example

Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)

Page 56: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

InterPro

• InterPro combines 9 other databases such as SMART, Pfam, Prodom and more.

• Queries can use many different methods (as the other databases use different methods).

• However, thresholds are predefined and cannot be changed for those methods.

Page 57: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

InterPro

• Provides more results, but can sometimes be redundant.

• Coverage statistics:• 93% of Swiss-Prot v42.5 –

128540 out of 138922 proteins• 81% of TrEMBL v25.5 –

819966 out of 1013263 proteins

Page 58: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

InterPro: Features

• Searching by Protein/DNA sequences

• Finding domains & homologs

• List of InterPro entries of type: – Family– Domain– Repeat– PTM- Post Transcriptional modifications– Binding Site– Active Site– Keyword

Page 59: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

InterPro: Example

• Kringle domain

Page 60: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Research Example: Introduction

• Goal: The systematic identification of novel protein domain families.

• Using computational methods.

Page 61: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Research Example: Method

Derive set of 107 nuclear domains

extract proteins

Extract unannotated regions

Cluster sequences

Take longest member

PSI-BLAST

Investigate homologous regions

Manual confirmation

Page 62: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Research Example: Results

• 28 New Domains identified:

• 15 domains in diverse contexts, in different species.

• 3 domains species specific.• 7 domains with weak similarity to

previously described domains.• 3 extension domains.

Page 63: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Predictions of Function

• On the basis of reports in literature and/or occurrence with other identified domains, functional features can be predicted for our novel domain families.

• Examples:– Chromatin binding– Protein Interaction– Predicted sub-cellular localization

Page 64: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Predictions of Function:Chromatin-Binding example

• The novel domain CSZ is contained in protein SPT6, which regulates transcription via chromatin structure modification.

• SPT6 has a histone-binding capability, experimentally confirmed.

• Other domains (S1, SH2) in SPT6 are unlikely to bind histones or chromatin.

• Conclusion: CSZ has a predicted histone binding function.

Page 65: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Predictions of Function:Localization example• Some of the novel domains are only

found within proteins from the initial set of nuclear domains.

• This predicts that these domains have a nuclear function.

• The other domains are likely to have roles in both nucleus and cytoplasm.

Page 66: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Conclusion

• Domains are the functional units of proteins.• Identifying a domain within a new protein may

teach us much about it.

• There are several types of models to represent domains.

• These models can also be used to identify the domain they represent.

• Many Internet databases available to catalogue and identify families.

• Protocol to identify new domains using old ones.

Page 67: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Resources

• Pfam:http://www.sanger.ac.uk/Software/Pfam/

• SMART: http://smart.embl-heidelberg.de/

• PROSITE:http://www.expasy.org/prosite/

• InterPro:http://www.ebi.ac.uk/interpro/

Page 68: Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

The End