Charting the Protein Space Structural and Functional Genomics

89
Belgium 10/03 Michal Linial 1 Charting the Protein Space Structural and Functional Genomics l Linial The Hebrew University, Jeru

description

Charting the Protein Space Structural and Functional Genomics. Michal Linial The Hebrew University, Jerusalem. Structure is more conserved than sequence. Similar structure tend to have similar function. Extract structural information from sequence alone (The Holy Grail). - PowerPoint PPT Presentation

Transcript of Charting the Protein Space Structural and Functional Genomics

Page 1: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 1

Charting the Protein Space Structural and Functional Genomics

Michal Linial The Hebrew University, Jerusalem

Page 2: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 2

Function

A link between sequence, structure and function

Sequence

Structure

Extract structural information from sequence alone (The Holy Grail)

Structure is more conserved than sequence

Similar structure tend to have similar function

Sequences are ‘easy’Structures are ‘hard’Functions are to be defined

Page 3: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 3

Structure spacesparse

Function spaceIll-defined

2000-20,000

?????(20,000 by GO)

The protein space sequence, structure and function

Sequence spacedense

1,000,000

Page 4: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 4

Protein Sequences 1,000,000 pr (static)

Protein Variants 10,000,000 pr (dynamic)

Exon combinations, post-translation modification, p-p interaction…

Protein Function ?????

enzymes

catalyticsignaling

Structural proteins sensorschannels

Intrinsic difficulty in defining function

The protein Space static vs dynamic

Page 5: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 5

New genomes ---> Accurate annotation

From sequence ---> Predicting structure

From sequence ---> Infer function

The Challenges of the ‘Proteome’ in the Genomic

era

Proteins in a cellular context (health) Modification Localization Interactions Pathways Disease

Page 6: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 6

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 7: Charting the Protein Space Structural and Functional Genomics

7

Belgium 10/03

Structural Genomics Intiatives

Goal: Cover the entire protein structural space

Modeling methods allow expending structural assignments to an unsolved protein if a solved protein is within a ‘modeling distance’ (>30-35% sequence identity) from an unsolved one.

Finding a new Fold = Adding a new template to the ‘archive’ = Allowing (many !) ‘unsolved’ proteins to be modeled.

MotivationMotivation

Page 8: Charting the Protein Space Structural and Functional Genomics

8

Belgium 10/03

Structural Genomics Intiatives

And as stated by SG policy (1999)

“Maximizing the impact on biology and on biomedical sciences by solving the ‘CORRECT’ pre-selected candidates”

What is the ‘CORRECT’ set of proteins ??

How to select those from all possible unsolved proteins ??

MotivationMotivation

Page 9: Charting the Protein Space Structural and Functional Genomics

9

Belgium 10/03

Structural databaseStructural database

Number of new structures added each year (from the PDB)

CurrentState

CurrentState

Page 10: Charting the Protein Space Structural and Functional Genomics

10

Belgium 10/03

Structural databaseStructural database

The fraction of new folds is constantly decreasing

During last 5 years only 3-5% (by SCOP definition) of all new solved structures are new folds (5-10% by CE).

CurrentState

CurrentState

Page 11: Charting the Protein Space Structural and Functional Genomics

11

Belgium 10/03

The Structural Spacesome numbers

Myoglobin

Currently : ~18,000 protein structures ~45,000 protein domains

Hierarchy in structure SCOP 1.59, 3/02 SCOP 1.61, 11/02

Folds - 690 700 +10SF - 1070 1110 +40Fam - 1830 1940 +110

Domain -39,900 44,300 +4,400

CurrentState

CurrentState

Page 12: Charting the Protein Space Structural and Functional Genomics

12

Belgium 10/03

Some numbers Some numbers Fold, SF, FamFold, SF, Fam

Sequence-Base : 130,000 SWP, 900K TrEMBL , Total: >1M

Estimated numbers: Structure-Base :

1,000 - 2,000 folds 3,000- 8,000 superfamiles 10,000-20,000 families (25-35% sequence identity)(But many more ‘unique’ folds/superfamilies ?)

Sequence-b

ased

Sequence-b

ased

Struct

ure-b

ased

Struct

ure-b

ased

CurrentState

CurrentState

Reduction -How??

Page 13: Charting the Protein Space Structural and Functional Genomics

13

Belgium 10/03

From sequence to From sequence to structure structure

Sequence-b

ased

Sequence-b

ased

Struct

ure-b

ased

Struct

ure-b

ased

ChallengeChallenge

Reduction -How??

Problem: Most structurally similar pairs share <20% aa identityMany structurally similar pairs share only few key aa (5-8%, background)

Most (all) sequence search engines cannot find a ‘significant’ similarity below 35-40% aa identity

So, can we cross the line to the ‘Twilight Zone’ (20-35% aa identity) to the ‘midline zone’ (<20% aa identity)

Page 14: Charting the Protein Space Structural and Functional Genomics

14

Belgium 10/03

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 15: Charting the Protein Space Structural and Functional Genomics

15

Belgium 10/03

ProtoClass - Set of automatic classifications of all proteins

Seeking statistically significant regularities (clusters)

Reconstruct the ‘geometry’ of the sequence space

Guiding principleHomologous proteins evolved from common ancestor protein

Homology is a transitive relation that can be deduced based on statistical similarities

Page 16: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 16

Global classifications of all proteins

ProtoClass systems generate graphs and maps that yield views at any levels of granularity.

ProtoMap release May 1997

ProtoNet - A (arithmetric) release July 2002ProtoNet - G (geometric) release July 2002ProtoNet - H (harmonic) release July 2002

Proto3D + ProtoNet -T October 2003

ProtoNet - A50 July 2003Proto3D - A50 July 2003

ProtoClassProtoClass

Page 17: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 17

Pre-Computation

• SwissProt release 40.28 database (ProtoNet 2.4)– 114,000 SWP proteins – 133,000 + 850,000 TrEMBL sequences (ProtoNet 3.0)

• All-against-all similarity scores by gapped BLAST– Using BLOSUM62, eliminating low-complexity (also other matrices, BLOSUM 50, PAM 250..)

• BLAST identified >13M relations between 114K SWP proteins – sequence similarity E-Score of 100 !!! or less is collected

Page 18: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 18

Includes all SwissProt proteins (130K) TrEMBL proteins (850K)

Hierarchical

Graph based

Pairwise distances (all against all BLAST search)

Unsupervised and automated

ProtoNet main features

Bottom-up clustering

ProtoClassProtoClass

The clustering algorithm is based on a ‘merging score’

Page 19: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 19

ProtoNet top 20 ProtoNet top 20

20 largest clusters in the ProtoNet at pre-selected horizontal level (7K) Added hypothetical proteins 7-15% 15-20%

ProtoClassProtoClass

Page 20: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 Michal Linial 20

Towards functional Map

Roadmap of Ig Superfamily

Yona G., Linial N., Linial M. Proteins 37:360-378 (1999)

Edges connect clusters that are neighbors but failed to merge at that LEVEL of the graph

Many pairs of proteinswith <<20% aa identity

Page 21: Charting the Protein Space Structural and Functional Genomics

21

Belgium 10/03

Seeking missing folds

A rational computational procedures for identifying ‘missing/hidden’ folds/SF

What is missing:

Constructing the protein sequence space as a guideline for structural fold spaceCrossing the twilight zone

Our approach:

GoalGoal

Page 22: Charting the Protein Space Structural and Functional Genomics

Bridging Structure & Bridging Structure & SequenceSequence

Hypothesis: Distances in the graph (road-map) are consistent with distances between protein features, including their structure.

Practically: Unsolved clusters that are ‘remote’ (in the road-map) from an already solved structure will have higher chance to have new folds or new superfamilies.

Seq-Strmap

Seq-Strmap

Page 23: Charting the Protein Space Structural and Functional Genomics

Distance measure via Distance measure via Structural perspectiveStructural perspective

Good target?

In PDB

Good target?

Clo

ck -

Pair

Tim

e

Create Proto3D(all SWP+all PDB domains)(114K+36K= 150K)

Seq-Strmap

Seq-Strmap

Page 24: Charting the Protein Space Structural and Functional Genomics

24

Belgium 10/03

GlobinsGlobinsExampleExample

Short (~120-160 aa)Oxygen transport in multi-cell organismsSingle domain

Spread in evolutionEarly evolutionary duplicationsSequence similarity <15%

SCOP identified 50 ! different family members (neuronal, plant…)

Page 25: Charting the Protein Space Structural and Functional Genomics

Some biological Road MapsSome biological Road Maps

SCOP

Fold: Globin - like

SF: A.Globin-like B. helical ferredoxin

Fam A: 1. Globin (50)2. 3. Neural globin (1)4.

Fam B: 1. 2.

Seq-Strmap

Seq-Strmap

All 850 proteins are globin relatedAll belong to one SF (SCOP)

Page 26: Charting the Protein Space Structural and Functional Genomics

26

Belgium 10/03

Mapping SCOP structure on the Mapping SCOP structure on the

Sequence-based clustersSequence-based clusters Mapping SCOP structure on the Mapping SCOP structure on the

Sequence-based clustersSequence-based clusters

Sassson et al (2003) Nucl. Acids Res. 31

Murzin A. G. et al. (1995). J. Mol. Biol. 247, 536-540

ProtoNet SCOP

Currently~2000 fam

A very good correspondence between clusters and SCOP families

Page 27: Charting the Protein Space Structural and Functional Genomics

27

Belgium 10/03

Seeking new folds

We developed a navigating procedure that measures ‘distances’ among protein clusters in the graph of view of proteins that were already solved (X-ray, NMR)

Structural information is embedded in the roadmap of ProtoClass (I.e., globins)

Our approach:

Page 28: Charting the Protein Space Structural and Functional Genomics

28

Belgium 10/03

Computational Approach for Target Selection

Adding Structures to the map

ProtoNet (at a selected level)~10,000 clusters ; 2000 clusters > 15 proteins each

SCOP 1.50 (2000)~10,500 PDB structures, 24,000 domains (redundant)

Each structural domain is mapped to its proteins (and its cluster). ‘Occupied’ clusters are those with at least one solved structural domain.

Page 29: Charting the Protein Space Structural and Functional Genomics

29

Belgium 10/03

Mapping ‘Structures’ on the Protein Graph

Mapping ‘Structures’ on the Protein Graph

Databases usedDatabases used

StructuralStructuralAll PDB entriesAll PDB entriesSequenceSequenceProtoClass (I.e. ProtoMap, ProtoNet)ProtoClass (I.e. ProtoMap, ProtoNet)

occupiedoccupied

occupiedoccupied

~only 1800 clusters are ‘occupied’. They accounts for ~50% of all proteins in the protein map.

Page 30: Charting the Protein Space Structural and Functional Genomics

30

Belgium 10/03

occupiedoccupied

A

A distance measure in the graphA distance measure in the graph vacantvacant surrounding volumessurrounding volumes

A distance measure in the graph (VSV):the vacant-surrounding-volume of a clusteris the number of clusters before encountering an occupied cluster

Steps= 3VSV = 11

Clusters are associated with VSV (if at least one structure is in the local map)

All clusters are sorted according to their VSV.

Page 31: Charting the Protein Space Structural and Functional Genomics

31

Belgium 10/03

Prioritized Target List

Higher VSV, higher chance for NEW SUPERFAMILY ?

Page 32: Charting the Protein Space Structural and Functional Genomics

32

Belgium 10/03

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 33: Charting the Protein Space Structural and Functional Genomics

33

Belgium 10/03

Testing the predicting power of the VSV navigation method

VSV & NEW SUPERFAMILY ?The membranous protein test

All clusters

Membranous

Most clusters with membranous proteins have much higher VSV.This is in accord with the fact that very very small number of membranous proteins were solved(50 out of 20,000)

Page 34: Charting the Protein Space Structural and Functional Genomics

34

Belgium 10/03

Validation against new dataValidation against new data

SCOP 1.37 (~12,000 records )

~800 families~570 superfamilies~410 folds

SCOP 1.50 (~23,800 records )~1300 families, ~820 superfamilies ~550 folds

As test set

As base set

Page 35: Charting the Protein Space Structural and Functional Genomics

35

Belgium 10/03

Validation against new dataValidation against new data

Test the prediction by the VSV method (BASE set) with the actual assignment of new SF in recent data (TEST set).

BASE SET ~570 superfamilies

TEST SET ~820 superfamilies

250 additional new SF

The Base Set and the Test Set have no overlap

Page 36: Charting the Protein Space Structural and Functional Genomics

36

Belgium 10/03

Testing the predicting power of the VSV navigation method

VSV & NEW SUPERFAMILY ?

Statistical test

Prediction is based on 13,000 domains (1999)

Test is based on new added 11,000 domains (2001)

673

123

128

78

46

42

0

100

200

300

400

500

600

700

800

VSV=all VSV1 VSV2

New SF

Page 37: Charting the Protein Space Structural and Functional Genomics

37

Belgium 10/03

VSV according to set of SCOP 1.37 to 1.50

VSV

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7

vsv3

Series2

Series1

3 4 5 6 7 8 10

NEW

Known

Our hypothesis is confirmed - the higher the VSV is, the chance of a protein to belong to a new SF increases

Page 38: Charting the Protein Space Structural and Functional Genomics

38

Belgium 10/03

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 39: Charting the Protein Space Structural and Functional Genomics

39

Belgium 10/03

ProTarget - a web site that assign a ‘SCORE’ for proteins according to their probability to belong to new superfamily (or fold)

ProTarget - a web site that assign a ‘SCORE’ for proteins according to their probability to belong to new superfamily (or fold)

Back from Prediction to the experimentalists

We suggest a ranked list that is ‘BEST’ for SG projects. The user may select any subset

Page 40: Charting the Protein Space Structural and Functional Genomics

40

Belgium 10/03

ProTarget

Page 41: Charting the Protein Space Structural and Functional Genomics

41

Belgium 10/03

Page 42: Charting the Protein Space Structural and Functional Genomics

42

Belgium 10/03

Development in ProTarget

Dynamic view- Proteins that have been solved affect the the map and of course the VSV ranking .

Can I find the group of proteins that once ‘solved’ their impact is maximal (affected the ranking of at least X proteins)

Other features

Including domain composition to the VSV ranking method (coming)

Page 43: Charting the Protein Space Structural and Functional Genomics

When New structure is solved (or about to be solved), a new map is created with new VSV and prioritization is done automatically

Using the dynamic option,redundancy in solving similar structures is reduced

Page 44: Charting the Protein Space Structural and Functional Genomics

44

Belgium 10/03

Using ProTarget dynamically Using ProTarget dynamically

Page 45: Charting the Protein Space Structural and Functional Genomics

45

Belgium 10/03

To the experimentalist (SG center) Structural Genomics Projects

Targets

Cloning

Expression

Solubility

Crystallization

Page 46: Charting the Protein Space Structural and Functional Genomics

46

Belgium 10/03

Considerations in solving structuresConsiderations in solving structures

BiologicalBiological PracticalPractical

•Quantity - sources•Folding properties• Expression system• Intrinsic stability• Bad history, membranous• Glory, money and fame•.…….

•Quantity - sources•Folding properties• Expression system• Intrinsic stability• Bad history, membranous• Glory, money and fame•.…….

• Novel biological activity ?• Selectivity? Specificity?• Ligand / drug binding ?• Disease related?• Drug design relevance?•…...

• Novel biological activity ?• Selectivity? Specificity?• Ligand / drug binding ?• Disease related?• Drug design relevance?•…...

ComputationalComputational

Page 47: Charting the Protein Space Structural and Functional Genomics

47

Belgium 10/03

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 48: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0348

Disease

Evolution

Genes, regulation

The Subway, Tube, Underground, Metro, U-Bahn

Protein

Page 49: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0349

Sequence and Function relationship

taking one example: Enzymeswell characterized

functionality is definedconserved

essential, testabletree like classification

Page 50: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0350

Relatively easy ‘function’

ENZYMES

DB -Enzyme, WIT, KEGG etc

Page 51: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0351

Structurally based alignments of structurally and functionally characterized sequences

(Human)

90%

(Chick)

45%(E coli)

(E coli)

(B ster.)

20%

(E coli)

(Yeast)

Sequence 5.3.1.1 (TP Isomerase)

SameExact

5.3.1.1 (TP Isomerase)

BothClass 5 (isom.)

5.3.1.1 (TP Isomerase)

5.3.1.24 (PRA Isomerase)

5.3.1.15 (Xylose Isom.)

DifferentClasses

4.1.3.3 (Aldolase)

4.2.1.11 (Enolase)

Function

Page 52: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0352

010

2030

4050

6070

8090

100

010203040506070

Relationship of Similarity in Sequence to that in Function

%ID

Sequence similarity of pairs of proteins

% S

ame

Fu

nct

ion

Percentage of pairs that have same precise (enzymatic) function

M. G

ers

tein

, Yale

Page 53: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0353

010

2030

4050

6070

8090

100

010203040506070

Relationship of Similarity in Sequence to that in Function

%ID

% S

ame

Fu

nct

ion

M. G

ers

tein

, Yale

Page 54: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0354

Can transfer both

Fold & Functional Annotation

010

2030

4050

6070

8090

100

010203040506070

Relationship of Similarity in Sequence to that in Function

%ID

% S

ame

Fu

nct

ion

M. G

erstein M. G

ers

tein

, Yale

Page 55: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0355

Can not transfer Fold or Functional

Annotation("Twilight Zone")

Can transfer Annotation related

Fold but not Function

Can transfer both

Fold & Functional Annotation

010

2030

4050

6070

8090

100

010203040506070

Relationship of Similarity in Sequence to that in Function

%ID

% S

ame

Fu

nct

ion

M. G

erstein M. G

ers

tein

, Yale

Page 56: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0356

Limitations in functionalal transfer between sequences

Functional knowledge via annotations of protein sets

Page 57: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0357

Few words on

Large Scale experiments

Protein Annotation

Page 58: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0358

“omics”: genomics and proteomics

• Main idea: use high throughput as a mean of tackling biological complexity.

DIGE 2d gel DNA microarray SELDI-TOF spectrum

Page 59: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0359

“omic” research

• Experimental Stage: data collection

• Computational Stage: statistical analysis

• Result: “graveyards” of genes/proteins

CD44 HSP CAT ERP2 RPL1 ENO

SODa TRD PMS DUF ACT GLU

Page 60: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0360

A protein graveyardCRZ1 HMO1 POL1 SNU13 RPC10 SFL1 SNU13 RPC10BEM2 NHP6A DPB3 PRP19 RPB8 GAL4 PRP19 RPB8SYF1 EPL1 POL2 KEM1 RPB10 MIG1 KEM1 RPB10CDC13 SCC4 RFA2 SEH1 RPO26 HSF1 SEH1 RPO26SHE3 RSC9 RFA3 NPL6 STB4 MOT3 NPL6 STB4NCE4 ISW1 RFA1 HOT1 TOA2 STE12 HOT1 TOA2- ISW2 RFC3 DAL82 TOA1 NUT1 DAL82 TOA1ECM5 TRA1 RFC2 ACE2 SUA7 BDF1 ACE2 SUA7EAF3 - RFC4 BUR6/NCB1TAF1 UME6 BUR6/NCB1TAF1HFI1 IOC3 TOP1 NCB2 TAF9 MMS4 NCB2 TAF9MSI1 - TOF2 SSU72 TAF10 ABF2 SSU72 TAF10CAC2 CST6 RNH35 KIN28 TAF11 GAT1 KIN28 TAF11RSC1 or 2 HOF1 FOB1 MOT1 TAF3 RTG1 MOT1 TAF3RSC6 ACT1 SSA3 RPB9 TAF6 SKN7 RPB9 TAF6RSC8 ARP4 SSA2 RPB2 TAF12 TAF4/MPT1RPB2 TAF12STH1 ARP9 GLC7 RPB7 TAF7 SIR2 RPB7 TAF7SFH1 ARP8 TDH1, 2, 3RPO21 TAF5 MSN2 RPO21 TAF5RSC2 ARP7 PDC2 RPB4 SPT15 MET31 RPB4 SPT15CHD1 APN1 HPR5 RPB3 TAF2 HAC1 RPB3 TAF2SMC3 PHR1 - MED8 TAF8 SSL2 MED8 TAF8IRR1 NTG2 RIM1 SRB8 TFA2 RAD3 SRB8 TFA2SWI3 MSH6 MGM101 MED2 TFA1 UBP8 MED2 TFA1SNF12 MSH2 CBF5 SRB7 TFG1 CCT4 SRB7 TFG1SNF2 RAD26 CBF2 SSN2 TAF14/TFG3RPL10 SSN2 TAF14/TFG3SWI1 RPH1 CHL1 SRB4 TFG2 RPP0 SRB4 TFG2GCN5 MUS81 CDC14 FHL1 TFB4 RPL11A or BFHL1 TFB4SPT7 MEC1 SMC1 SRB5 TFB3 RPL12A or BSRB5 TFB3NGG1 RAD52 SMC2 SRB2 TFB2 RPL15B SRB2 TFB2SPT3 RAD59 SGS1 MED6 TFB1 RPL19A or BMED6 TFB1ADA2 MSH3 YCS4 RGR1 SSL1 RPL1A or BRGR1 SSL1YNG2 RAD7 MCD1 MED11 CCL1 RPL25 MED11 CCL1SPT8 RAD4 SCC2 SIN4 REB1 RPL3 SIN4 REB1SPT20 RAD14 CFT1 CSE2/MED9FHL1 RPL30 CSE2/MED9FHL1ESA1 RAD23 YSH1 GAL11 SUB1 RPL34A or BGAL11 SUB1RPD3 DPB4 REF2 MED7 GIS1 RPL35A or BMED7 GIS1HTB2 or 1 RFC5 NAM7 MED4 ARO80 RPL4A or BMED4 ARO80RSC58 RFC1 PAP1 MED1 FKH2 RPL8A or BMED1 FKH2IOC4 TOP2 PRP43 RPB11 - RPS0A or BRPB11 -ITC1 TOP3 PRP9 ROX3 IXR1 RPS11A or BROX3 IXR1NHP10 MIP1 PRP46 SRB6 SGV1 RPS1A SRB6 SGV1

Page 61: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0361

Biological analysis of protein sets

• Biological interpretation requires intimate knowledge of the proteins and is time-consuming.

• Usually only a few proteins are examined.• How can we interpret the results efficiently?• How can we understand the results of automatic

classification (I.e. ProtoNet clusters)?

• Solution: analysis of protein annotations.

Page 62: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0362

Protein annotations

• Annotation (keyword): a binary property of a protein, from a “library” of properties.

• Cover various biological aspects: function, structure, taxonomy, localization, biological pathway…

• Annotations come from different sources.

• Growth in annotation amount and variety.

• Libraries of annotations allow computational analysis.

Page 63: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0363

Gene Ontology (GO)Gene Ontology (GO)

GO provides controlled

annotations of :

1. Molecular function

2. Biological process

3. Cellular component

The annotations are part of a hierarchical graph, in which each GO term has a parent or parents, and might have child terms.

Page 64: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0364

Annotation typesSome non informativewords -complete genomeDisease…

Some are partiallyAnnotated EC x.y.

Growing very fastStill many terms are inconsistent

Page 65: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0365

Protein annotations

9% of SWP proteins have more than 20 annotation per protein (not including Taxonomy)

Page 66: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0366

Computational analysis – naïve

Something is missing…

60membrane

40enzyme

amountannotation

100 proteins:

Summation: a naïve method for protein set analysis.

Page 67: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0367

Intersection and inclusion

60 membrane40 enzyme

enzymemembrane membraneenzyme

enzyme

membrane

Page 68: Charting the Protein Space Structural and Functional Genomics

68

Belgium 10/03

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 69: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0369

• A web-base tool aimed at biological analysis of protein sets.

• Biological information is shown through intersection and inclusion.

• Goal: provide a “biological roadmap” of the protein set.

Page 70: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0370

enzyme

cytoplasm

hydrolase

transcription

nucleus

kinase

Method

enzyme

cytoplasm

hydrolase

transcription

nucleus

kinase

P1

100110

P2

110110

P3

111000

P4

111001

P5

111000

P6

111001

enzyme

hydrolase

cytoplasmnucleus

transcription

kinase

cytoplasmtranscription

nucleus

P1 P2 P3 P4 P5 P6

P2 P3 P4 P5 P6

P3 P4 P5 P6

P4 P6

P1 P2

P2

=6

=5

=4

=2

=2

=2

Page 71: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0371

10101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000101111000000010101011111110100101010100010000010010010101010001000001001001010101000100000100100101010100010101011111111111101000000101010111111111110000000101010111111111110000000101010111111111111101001010100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111000000000011111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111111110000000010101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000101111000000010101011111110100101010100010000010010010101010001000001001001010101000100000100100101010100010101011111111111101000000101010111111111110000000101010111111111110000000101010111111111111101001010100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111000000000011111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111110101010010111100000001010101111111111111100000001010101111111111101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000011111000000010101011111110010010101010001000001001001010101000100000100100101010100010000010010010101011111000000010101011111110001000001001001010101000100000100100101010100010000010010010000000011111111111111111111111000000001010100001111111111111111111000000010101011111111111000000010101011111111111000000010101011111111111000000010101011111111111000000010101011000000000000000000001111000000010101011111111111000000101010101001100101010111111100101010101001011111000000010101011111111111111010101000101010010111111111110000000101010111111111111010000001010101111111111100000001010101111111111100000001010101111111111111010010101000000010101011111111111000000010101011111111111000000010101011111111111000000010101011110000000000111111100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111111101010100101111000000010101011111111111111000000010101011111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101110000000101010111111111110000000101010111111100001010001111000000010101011111101010100111111000000010101111111101010100101111000000010101011111111111111000000010101011111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101110000000101010111111111110000000101010111111100001010001111000000010101011111101010100111111000000010101

Page 72: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0372

Graph complexity

• 20 keywords: >1,000,000 nodes

• This worst-case doesn’t occur for large K values in the protein-keyword world.

• Still, highly complex graphs do occur.

KK

n n

K2

1

Theoretical complexity:

Page 73: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0373

Re• A user-controlled threshold trading graph accuracy for simplicity.

• Represents the maximal level of error allowed, in proteins.

40

1022

8

40

22

10

35

15 15

14

35

16

Resolution = 2 proteins solution

Page 74: Charting the Protein Space Structural and Functional Genomics

74

Belgium 10/03

Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget

Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA

Outine

Page 75: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0375

Biological examples• Take set of all 576 proteins annotated by ‘GO molecular function’ as ‘anion

channel’.

• View through InterPro keywords (sequential signatures that described all ‘family/domain’ ).

Page 76: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0376

zoom

BASIC SET

GABA A receptor

Neurotransmitter-gated ion channel

Nicotinic acetylcholine

receptorVoltage-gated

chloride channel

Intracellular chloride channel

H+ transporting ATPase

Eukaryotic porin

InterProNumber of

proteins

Sensitivity: TP/(TP+FN)

red = FN white = TP

Page 77: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0377

InterPro

alpha subunit

beta subunit

gamma subunit

GABA A receptor

Page 78: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0378

TaxonomyEukaryota

chordata

drosophilla

C. elegans

human

chickenmammalia

rodentia

Page 79: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0379

Resolution:

0 proteins

Page 80: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0380

Resolution: 2

Page 81: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0381

Resolution: 8

Page 82: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0382

Resolution: 30

Page 83: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0383

Biological examples - experimental

• Comparative proteomic experiment: E. coli response to benzoic acid (Yan et al, 2002).

• A set of 51 proteins are down-regulated by a factor of 1.3 or more.

• Benzoic acid is known to inhibit E. coli growth (Lambert et al, 1997).

• Could we guess this without examining individual proteins?

Page 84: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0384

TransportMetabolism

Biosynthesis

Cell growth and/or maintenance

Amino acid biosynthesis

Vitamin biosynthesis

Protein biosynthesis

Nitrogen metabolism

Coenzyme biosynthesis

Lipid metabolism

Phosphate metabolism

Carbohydrate metabolism

GO biological process

Page 85: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0385

Conclusion

• PANDORA offers:– Interactive comprehensible graph display.– Full protein-keyword intersection and

inclusion relations.– User-controlled data simplification.– Integration of >6 annotation sources.– Detection of false annotations (new).– Quantitative view (new)

Page 86: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0386

www.pandora.cs.huji.ac.il

We come in peace…

Page 87: Charting the Protein Space Structural and Functional Genomics

Belgium, 10/0387

Summary

• ProtoNet - Validated clustering and functional maps www.protonet.cs.huji.ac.il

• ProtoView -Navigation tools

• ProTagert - Ranked list of protein for structural genomics www.protarget.cs.huji.ac.il

• PANDORA - Interactive comprehensible graph of any protein set

www.pandora.cs.huji.ac.il

From Sequence to Structure To Function

Page 88: Charting the Protein Space Structural and Functional Genomics

Belgium 10/03 88

Functional map

Structural map

From Sequence to Structure to Function

From Sequence to Structure to Function

Sequence map

Page 89: Charting the Protein Space Structural and Functional Genomics

89Belgium, 10/03

Thank youThank you

Ilona Kifer ProTarget & cross-validationOri Sasson Proto3DElon Portugaly Domain based classification - EVERESTOri Shachar ProtoNet & FSSP Avishy Vaaknin InterPro ValidationNoam Kaplan PANDORA and ValidationNati Linial ProtoNet Algorithm

The rest of ProtoNet Team (Hillel, Uri, Moriah, Alex, Yonatan, Hagit, Menachem)