Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Identification of Protein DomainsEden DrorMenachem Schechter

Computational Biology Seminar 2004

Overview

• Introduction to protein domains.– Classification of homologs.

• Representing a domain.– PSSM– HMM

• Internet resources– Pfam– SMART– PROSITE– InterPro

• Research example.

Protein domains

• A discrete portion of a protein assumed to fold independently, and possessing its own function.

• Mobile domain (“module”): a domain that can be found associated with different domain combinations in different proteins.

Protein domains

• The assumption: The domain is the fundamental unit of protein structure and function.

• Protein family – all proteins containing a specific domain.

What can we learn from them?

• Common ancestors & homology information of a set of proteins.

• Homology can induce properties of a protein like functionality & localization.

• Therefore, domains can be used to classify a new protein to a family, inferring functionality.

Classification of homologs

• Homology is not a sufficiently well-defined term to describe the evolutionary relationships between genes.

• Homologous genes can be derived by two major ways: – Gene duplication (in the same species).– Speciation (splitting of one species into

two).


• Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species.

• Paralogs – Two genes that derive from a single gene that was duplicated within a genome.


para

para

ortho

ortho


• Inparalogs - paralogs that evolved by gene duplication after the speciation event.

• Outparalogs - paralogs that evolved by gene duplication before the speciation event.


out-para

In-para

In-para

When comparing human with worm

What can we learn from them?

• Ortholog proteins are evolutionary, and typically functional counterparts in different species.

• Paralog proteins are important for detecting lineage-specific adaptations.

• Both of them can reveal information on a specific species or a set of species.

Protein domains – summary

• By identifying domains we can:

– infer functionality & localization of a protein.

– Learn on a specific species.– Learn on a set of species as a group.

Domain representation

• Different methods to represent (model) domains:

• Patterns (regular expressions).• PSSM (Position specific score matrix).• HMM (Hidden Markov model).

PSSM

• Position specific score matrix

• Score matrix representing the score for having each amino acid in a given position in a specific sequence.

• Based on the independent probabilities P(a|i) of observing amino acid a in position i.

PSSM: Example

PSSM: Identifying a domain

• Given a sequence and a PSSM:

• Run over all positions.• Score each sub-sequence according to

the matrix.

HMM: Hidden Markov Model

• Markov model: a way of describing a process that goes through a series of states.

• Each state has a probability of transitioning to the other states.

• xi is a random variable of state.x1 x2 x3 x4

HMM: Markov Model

• Example:• States are {0,1}

x1 =0 x2 =0 x3 =0 x4 =0

x1 =1 x2 =0 x3 =0 x4 =1

x1=0 x2=1 x3 =1 x4 =1

HMM: Markov Model

)|(

8.02.0

4.06.0)(

1 ixjxPa

aA

kkij

ij

• Transition matrix:

x1 x2 x3 x4

x

HMM: Markov Model

• State transition example:• States are the nucleotides A, T, G, C.


• Hidden Markov model:• Each state x emits an output y, at a

specific probability.• We only know the output

(observations).• Thus, the states are hidden.

y1 y2 y3 y4

x1 x2 x3 x4


• Example: states are {0,1}, output {0,1}

y1 =1 y2 =1 y3 =0 y4 =0

x1 =0 x2 =1 x3 =1 x4 =1

y1 =1 y2 =0 y3 =1 y4 =0

x1 =1 x2 =0 x3 =0 x4 =1


y1 y2 y3 y4

x1 x2 x3 x4

)|(

15.085.0

9.01.0)(

ixjyPb

bB

kkij

ij

• Emission matrix:

x

y

HMM: What can we do with it?

• Given (A, B):• Probability of given states and outputs

)|()|()|()()( 22121112121 xyPxxPxyPxPyyyxxxP nn

nxx

nnn yyyxxxPyyyP

1

)()( 212121

)|(max 2121 nn yyyxxxP

• Most likely sequence of states that generated a given output sequence

• Probability of a given output sequence

HMM: What can we do with it?

• Learning:

• Given state and output sequences calculate the most probable (A, B).

• Easy when the states are known.

• Otherwise: use a training algorithm.

HMM: Profile HMM

• Use HMM to represent sequence families.

• A particular type of HMM suited to modeling multiple alignments.

• (Assume we have a multiple alignment).

HMM: Trivial profile HMM

• We begin with ungapped regions.

• Each position corresponds to a state.• Transitions are of probability 1.


• Let ei(a) be the independent probability of observing amino acid a in position i.

• The probability of a new sequence x, according to the model:

)()|(1

ii

N

ixeMxP


• We can score the sequence x:

• Where q indicates the probability under a random model.

ix

iiN

i q

xeS

)(log

1


• Consider the values

• They behave like elements in a score matrix.

• The trivial profile HMM is equivalent to a PSSM.

ix

ii

q

xe )(log

HMM: profile HMM

• Let’s untrivialize by allowing for gaps: insertions and deletions.

• Start off with the PSSM HMM.

HMM: profile HMM

• Handling insertions:

• Introduce new states Ij – match insertions after position j.

• These states have random emission probabilities.

HMM: profile HMM

• The score of a gap of length k:

jjjjjj IIMIIM akaa log)1(loglog1

HMM: profile HMM

• Handling deletions:

• Introduce silent states Dj.

• These states do not emit.

HMM: profile HMM

• The complete profile HMM:

Internet resources

• Databases of protein families.• Family information and identification.

• Considerations:– Type of representation (pattern, PSSM,

HMM).– Choice of seed multiple alignment proteins.– Quality control.– Database features (links, annotations,

views).– Database Specificity (organism, functions).

Pfam: Home

Pfam

• Protein families database of alignments and HMMs

• Uses profile-HMMs to represent families.

• For each family in Pfam you can:– Look at multiple alignments – View protein domain architectures – Examine species distribution – Follow links to other databases – View known protein structures

Pfam: Databases

2 databases:• Pfam-A – curated multiple alignments.

– Grows slowly.– Quality controlled by experts.

• Pfam-B – automatic clustering (ProDom derived).– Complements Pfam-A.– New sequences instantly incorporated.– Unchecked: false positives, etc.

Pfam: Features

• Search by: Sequence, keyword, domain, taxonomy.

• Browsing by family or genome.

• Evolutionary tree

Pfam: Construction

• Source of seed alignments:– Pfam-B families.– Published articles.– 'domain hunting' studies.– occasionally using entries from other

databases (e.g. MEROPS for peptidases).

Pfam: Domain information

Pfam: Domain organization

Pfam: Multiple alignment

Pfam: HMM logo

Pfam: Species distribution

Pfam: Genome comparison

PROSITE

• Database of protein families.

• Matching according to simple patterns or PSSM profiles.

• Browsing all proteins of a specific family.

• Latest release knows 1696 protein families.

PROSITE: Features

• Comprehensive domain documentation.• All profile matches checked by experts.• Specificity/sensitivity:• Specificity: true-pos/all-pos• Sensitivity: true-pos/(true-pos + false-

neg)

PROSITE: Example

• Specificity of Zinc finger C2H2 type domain

SMART

• Simple Modular Architecture Research Tool

• Identification and annotation of genetically mobile domains and the analysis of domain architectures.

• SMART consists of a library of HMMs.

• Knows 665 HMMs to date.

SMART: Features

• finding proteins containing specific domains i.e. of the same family

• Function prediction• Sub-cellular localization• Binding partners• Architecture• Alternative splicing information• Orthology information

SMART: Domain selection example

Tyrosine kinase (TyrKc) AND Transmembrane region (TRANS)

InterPro

• InterPro combines 9 other databases such as SMART, Pfam, Prodom and more.

• Queries can use many different methods (as the other databases use different methods).

• However, thresholds are predefined and cannot be changed for those methods.

InterPro

• Provides more results, but can sometimes be redundant.

• Coverage statistics:• 93% of Swiss-Prot v42.5 –

128540 out of 138922 proteins• 81% of TrEMBL v25.5 –

819966 out of 1013263 proteins

InterPro: Features

• Searching by Protein/DNA sequences

• Finding domains & homologs

• List of InterPro entries of type: – Family– Domain– Repeat– PTM- Post Transcriptional modifications– Binding Site– Active Site– Keyword

InterPro: Example

• Kringle domain

Research Example: Introduction

• Goal: The systematic identification of novel protein domain families.

• Using computational methods.

Research Example: Method

Derive set of 107 nuclear domains

extract proteins

Extract unannotated regions

Cluster sequences

Take longest member

PSI-BLAST

Investigate homologous regions

Manual confirmation

Research Example: Results

• 28 New Domains identified:

• 15 domains in diverse contexts, in different species.

• 3 domains species specific.• 7 domains with weak similarity to

previously described domains.• 3 extension domains.

Predictions of Function

• On the basis of reports in literature and/or occurrence with other identified domains, functional features can be predicted for our novel domain families.

• Examples:– Chromatin binding– Protein Interaction– Predicted sub-cellular localization

Predictions of Function:Chromatin-Binding example

• The novel domain CSZ is contained in protein SPT6, which regulates transcription via chromatin structure modification.

• SPT6 has a histone-binding capability, experimentally confirmed.

• Other domains (S1, SH2) in SPT6 are unlikely to bind histones or chromatin.

• Conclusion: CSZ has a predicted histone binding function.

Predictions of Function:Localization example• Some of the novel domains are only

found within proteins from the initial set of nuclear domains.

• This predicts that these domains have a nuclear function.

• The other domains are likely to have roles in both nucleus and cytoplasm.

Conclusion

• Domains are the functional units of proteins.• Identifying a domain within a new protein may

teach us much about it.

• There are several types of models to represent domains.

• These models can also be used to identify the domain they represent.

• Many Internet databases available to catalogue and identify families.

• Protocol to identify new domains using old ones.

Resources

• Pfam:http://www.sanger.ac.uk/Software/Pfam/

• SMART: http://smart.embl-heidelberg.de/

• PROSITE:http://www.expasy.org/prosite/

• InterPro:http://www.ebi.ac.uk/interpro/

http://www.sanger.ac.uk/Software/Pfam/














http://smart.embl-heidelberg.de/










http://www.expasy.org/prosite/










http://www.ebi.ac.uk/interpro/












The End

Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.

Documents

Transcript of Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.