Proteomics: Analyzing proteins space

Proteomics: Analyzing proteins

space

Proteomics: Analyzing proteins

space

Protein familiesProtein familiesWhy proteins? • Shift of interest from “Genomics” to “Proteomics”

Classification of proteins to groups/families - what is it good for? • Explosion in biological sequence data => need to organize!

• Understanding relations/hierarchy of groups is interesting as is,

e.g. in evolutionary research.

• For applied research :

– Annotation of new proteins : predicting their function,

structure, cellular localization etc.

– Looking for new folds

Sequence-based classificationSequence-based classification

• By sequence similarity (domains, motifs

or complete proteins) : Pfam, PROSITE,

SMART, InterPro etc.

• InterPro – Synthesizes the data from Pfam,

PROSITE, Prints, ProDom, and SMART.

Considered as “best” domain-based classification

available

Other kinds of classificationOther kinds of classification• Global classification :

– Systers, Protomap, CLUSTr– MetaFam synthesizes global classification data

• By structure similarity : SCOP etc.

• By function : Albumin, RetNet, TumorGenes

etc.

• A long-term project in HUJI led by

Michal & Nati Linial.• Provides automatic global

classification of the known proteins.• Performs hierarchical clustering on sequence-based metric space of proteins.

• Allows to “place” an external protein into the hierarchy.

http://www.protonet.cs.huji.ac.il

Why clustering?

• We want to refine the “similarity” notion, compared to e.g. BLAST

• Exploit transitivity to improve grouping

• Can use a low threshold on similarity:

- uses vast information from low similarities

- allowable because clustering filters noise

Why hierarchical?Vertical Perspective

Horizontal Perspective

ProtoNet: Pre-Computation

• All-against-all gapped BLAST using BLOSUM62• SwissProt release 40.28 database (114,033 proteins)• BLAST identified ~2*107 relations between these

proteins with relatively high sequence similarity E-Score of 100 or less:

• Don’t want to lose information => very permissive!• But still less then ~6.5*109 => infeasible

),( 21 ppd

Clustering Method

• First, each cluster is considered a singleton

Clustering Method

• Next, we iteratively merge the pairs of clusters

• We choose to merge the ‘most similar’ pair of clusters.

Clustering Method

• As we progress the number of singletons drops

Clustering Method

• The clustering process gradually generates a tree of clusters

• Stop whenever we like

How to merge?

• The potential merging score is calculated for each pair of clusters relevant for merging at each level

• At the bottom equals

• Higher, designed to reflect the similarity of clusters.

• Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster.

m n

),( 21 ppd

Potential Merging Score of

• Arithmetic Mean

VI

• Geometric Mean

VI

• Harmonic Mean

21

21)2,1(21 ),(CC

CCpp

ppd

),( 21 CC

21)2,1(

2121

),(1

CCpp

ppdCC

21)2,1(21

121

),(CCpp

ppd

CC

Missing Data Treatment

• For very low similarity pair (outside of ~2*107 ), its length is defined as

• Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)

)),((max),( 21,21 21ppdconstppd pp

Results: ProtoNet top 20Results: ProtoNet top 20

Why

cl

usteri

ng

at

all?

We

want

to

extend

the

range

of

“si

milarity”,

co

mpared

to

e.g.

BLASTExploit

transitivity

to

improve

groupingCan

use

a

low

threshold

on

si

milarity:

- uses

vast

infor

mation

fro

m

low

si

milarities

- allowable

because

clustering

filters

noise

20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level

Problem of result assessment: what is a “good” cluster?

• Contains all proteins in the family, does not

contain proteins not in family

• But what is family? Does any keyword define a

family?

• Stable as the merging events occur (long life-

time)?

Problem of result assessment: what is a “good” tree?

• Should we trust the resulting forest?

– Which clustering technique is better? Combined?

– Bootstrap?

• Do the clusters correspond to meaningful families of

proteins?

– Validation against InterPro, SCOP etc.

– Lack of will to automatically reconstruct them!!!

• What is the right level/cut to look at the forest?

Interpro Validation

• Interpro annotation allows systematic validation of the generated clustering

• The ‘geometric’ method exhibits high cluster purity– Corresponds to low FP

The Domain Problem

• Many proteins are composed of several domains

• The sequence similarity tools used are therefore local in

nature:

• The score of comparing two sequences is the edit

distance of the most similar subsequences of them

• This creates a false similarity problem:

The Modular Nature of Proteins

CSKP HUMAN

DLG3 MOUSE

K6A1 MOUSE

MPP3 HUMANSerine/Threonine protein kinase family active siteProtein kinase C-terminal domainPDZ domainSH3 domainGuanylate kinase

8e-78

2e-47

9e-41

1e-42

False Transitivity of Local Alignment

CSKP HUMAN

DLG3 MOUSE

MPP3 HUMAN

K6A1 MOUSE

We ran BLASTusing default parameters:

All these pairwise similarities havebetter than 1e-40 EScore

If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

Alternative methods

• Different types of clustering– Non-binary– Goal-oriented => semi-guided– Graph theory insights

• Non-clustering ways of exploring the space of proteins

• Why BLAST E-score???• Enrichment of the metric using structure

Proteomics: Analyzing proteins space

Documents

Transcript of Proteomics: Analyzing proteins space