Proteomics: Analyzing proteins space

25
Proteomics: Analyzing proteins space

description

Proteomics: Analyzing proteins space. Protein families. Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families - what is it good for? Explosion in biological sequence data => need to organize! - PowerPoint PPT Presentation

Transcript of Proteomics: Analyzing proteins space

Page 1: Proteomics: Analyzing proteins space

Proteomics: Analyzing proteins

space

Proteomics: Analyzing proteins

space

Page 2: Proteomics: Analyzing proteins space

Protein familiesProtein familiesWhy proteins? • Shift of interest from “Genomics” to “Proteomics”

Classification of proteins to groups/families - what is it good for? • Explosion in biological sequence data => need to organize!

• Understanding relations/hierarchy of groups is interesting as is,

e.g. in evolutionary research.

• For applied research :

– Annotation of new proteins : predicting their function,

structure, cellular localization etc.

– Looking for new folds

Page 3: Proteomics: Analyzing proteins space

Sequence-based classificationSequence-based classification

• By sequence similarity (domains, motifs

or complete proteins) : Pfam, PROSITE,

SMART, InterPro etc.

• InterPro – Synthesizes the data from Pfam,

PROSITE, Prints, ProDom, and SMART.

Considered as “best” domain-based classification

available

Page 4: Proteomics: Analyzing proteins space

Other kinds of classificationOther kinds of classification• Global classification :

– Systers, Protomap, CLUSTr– MetaFam synthesizes global classification data

• By structure similarity : SCOP etc.

• By function : Albumin, RetNet, TumorGenes

etc.

Page 5: Proteomics: Analyzing proteins space

• A long-term project in HUJI led by

Michal & Nati Linial.• Provides automatic global

classification of the known proteins.• Performs hierarchical clustering on sequence-based metric space of proteins.

• Allows to “place” an external protein into the hierarchy.

http://www.protonet.cs.huji.ac.il

Page 6: Proteomics: Analyzing proteins space

Why clustering?

• We want to refine the “similarity” notion, compared to e.g. BLAST

• Exploit transitivity to improve grouping

• Can use a low threshold on similarity:

- uses vast information from low similarities

- allowable because clustering filters noise

Page 7: Proteomics: Analyzing proteins space

Why hierarchical?Vertical Perspective

Horizontal Perspective

Page 8: Proteomics: Analyzing proteins space

ProtoNet: Pre-Computation

• All-against-all gapped BLAST using BLOSUM62• SwissProt release 40.28 database (114,033 proteins)• BLAST identified ~2*107 relations between these

proteins with relatively high sequence similarity E-Score of 100 or less:

• Don’t want to lose information => very permissive!• But still less then ~6.5*109 => infeasible

),( 21 ppd

Page 9: Proteomics: Analyzing proteins space

Clustering Method

• First, each cluster is considered a singleton

Page 10: Proteomics: Analyzing proteins space

Clustering Method

• Next, we iteratively merge the pairs of clusters

• We choose to merge the ‘most similar’ pair of clusters.

Page 11: Proteomics: Analyzing proteins space

Clustering Method

• Next, we iteratively merge the pairs of clusters

• We choose to merge the ‘most similar’ pair of clusters.

Page 12: Proteomics: Analyzing proteins space

Clustering Method

• Next, we iteratively merge the pairs of clusters

• We choose to merge the ‘most similar’ pair of clusters.

Page 13: Proteomics: Analyzing proteins space

Clustering Method

• As we progress the number of singletons drops

Page 14: Proteomics: Analyzing proteins space

Clustering Method

• The clustering process gradually generates a tree of clusters

• Stop whenever we like

Page 15: Proteomics: Analyzing proteins space

How to merge?

• The potential merging score is calculated for each pair of clusters relevant for merging at each level

• At the bottom equals

• Higher, designed to reflect the similarity of clusters.

• Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster.

m n

),( 21 ppd

Page 16: Proteomics: Analyzing proteins space

Potential Merging Score of

• Arithmetic Mean

VI

• Geometric Mean

VI

• Harmonic Mean

21

21)2,1(21 ),(CC

CCpp

ppd

),( 21 CC

21)2,1(

2121

),(1

CCpp

ppdCC

21)2,1(21

121

),(CCpp

ppd

CC

Page 17: Proteomics: Analyzing proteins space

Missing Data Treatment

• For very low similarity pair (outside of ~2*107 ), its length is defined as

• Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)

)),((max),( 21,21 21ppdconstppd pp

Page 18: Proteomics: Analyzing proteins space

Results: ProtoNet top 20Results: ProtoNet top 20

Why

cl

usteri

ng

at

all?

We

want

to

extend

the

range

of

“si

milarity”,

co

mpared

to

e.g.

BLASTExploit

transitivity

to

improve

groupingCan

use

a

low

threshold

on

si

milarity:

- uses

vast

infor

mation

fro

m

low

si

milarities

- allowable

because

clustering

filters

noise

20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level

Page 19: Proteomics: Analyzing proteins space

Problem of result assessment: what is a “good” cluster?

• Contains all proteins in the family, does not

contain proteins not in family

• But what is family? Does any keyword define a

family?

• Stable as the merging events occur (long life-

time)?

Page 20: Proteomics: Analyzing proteins space

Problem of result assessment: what is a “good” tree?

• Should we trust the resulting forest?

– Which clustering technique is better? Combined?

– Bootstrap?

• Do the clusters correspond to meaningful families of

proteins?

– Validation against InterPro, SCOP etc.

– Lack of will to automatically reconstruct them!!!

• What is the right level/cut to look at the forest?

Page 21: Proteomics: Analyzing proteins space

Interpro Validation

• Interpro annotation allows systematic validation of the generated clustering

• The ‘geometric’ method exhibits high cluster purity– Corresponds to low FP

Page 22: Proteomics: Analyzing proteins space

The Domain Problem

• Many proteins are composed of several domains

• The sequence similarity tools used are therefore local in

nature:

• The score of comparing two sequences is the edit

distance of the most similar subsequences of them

• This creates a false similarity problem:

Page 23: Proteomics: Analyzing proteins space

The Modular Nature of Proteins

CSKP HUMAN

DLG3 MOUSE

K6A1 MOUSE

MPP3 HUMANSerine/Threonine protein kinase family active siteProtein kinase C-terminal domainPDZ domainSH3 domainGuanylate kinase

Page 24: Proteomics: Analyzing proteins space

8e-78

2e-47

9e-41

1e-42

False Transitivity of Local Alignment

CSKP HUMAN

DLG3 MOUSE

MPP3 HUMAN

K6A1 MOUSE

We ran BLASTusing default parameters:

All these pairwise similarities havebetter than 1e-40 EScore

If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

Page 25: Proteomics: Analyzing proteins space

Alternative methods

• Different types of clustering– Non-binary– Goal-oriented => semi-guided– Graph theory insights

• Non-clustering ways of exploring the space of proteins

• Why BLAST E-score???• Enrichment of the metric using structure