Charting the Protein Space Structural and Functional Genomics
-
Upload
chantale-wooten -
Category
Documents
-
view
29 -
download
0
description
Transcript of Charting the Protein Space Structural and Functional Genomics
Belgium 10/03 Michal Linial 1
Charting the Protein Space Structural and Functional Genomics
Michal Linial The Hebrew University, Jerusalem
Belgium 10/03 Michal Linial 2
Function
A link between sequence, structure and function
Sequence
Structure
Extract structural information from sequence alone (The Holy Grail)
Structure is more conserved than sequence
Similar structure tend to have similar function
Sequences are ‘easy’Structures are ‘hard’Functions are to be defined
Belgium 10/03 Michal Linial 3
Structure spacesparse
Function spaceIll-defined
2000-20,000
?????(20,000 by GO)
The protein space sequence, structure and function
Sequence spacedense
1,000,000
Belgium 10/03 Michal Linial 4
Protein Sequences 1,000,000 pr (static)
Protein Variants 10,000,000 pr (dynamic)
Exon combinations, post-translation modification, p-p interaction…
Protein Function ?????
enzymes
catalyticsignaling
Structural proteins sensorschannels
Intrinsic difficulty in defining function
The protein Space static vs dynamic
Belgium 10/03 Michal Linial 5
New genomes ---> Accurate annotation
From sequence ---> Predicting structure
From sequence ---> Infer function
The Challenges of the ‘Proteome’ in the Genomic
era
Proteins in a cellular context (health) Modification Localization Interactions Pathways Disease
Belgium 10/03 Michal Linial 6
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
7
Belgium 10/03
Structural Genomics Intiatives
Goal: Cover the entire protein structural space
Modeling methods allow expending structural assignments to an unsolved protein if a solved protein is within a ‘modeling distance’ (>30-35% sequence identity) from an unsolved one.
Finding a new Fold = Adding a new template to the ‘archive’ = Allowing (many !) ‘unsolved’ proteins to be modeled.
MotivationMotivation
8
Belgium 10/03
Structural Genomics Intiatives
And as stated by SG policy (1999)
“Maximizing the impact on biology and on biomedical sciences by solving the ‘CORRECT’ pre-selected candidates”
What is the ‘CORRECT’ set of proteins ??
How to select those from all possible unsolved proteins ??
MotivationMotivation
9
Belgium 10/03
Structural databaseStructural database
Number of new structures added each year (from the PDB)
CurrentState
CurrentState
10
Belgium 10/03
Structural databaseStructural database
The fraction of new folds is constantly decreasing
During last 5 years only 3-5% (by SCOP definition) of all new solved structures are new folds (5-10% by CE).
CurrentState
CurrentState
11
Belgium 10/03
The Structural Spacesome numbers
Myoglobin
Currently : ~18,000 protein structures ~45,000 protein domains
Hierarchy in structure SCOP 1.59, 3/02 SCOP 1.61, 11/02
Folds - 690 700 +10SF - 1070 1110 +40Fam - 1830 1940 +110
Domain -39,900 44,300 +4,400
CurrentState
CurrentState
12
Belgium 10/03
Some numbers Some numbers Fold, SF, FamFold, SF, Fam
Sequence-Base : 130,000 SWP, 900K TrEMBL , Total: >1M
Estimated numbers: Structure-Base :
1,000 - 2,000 folds 3,000- 8,000 superfamiles 10,000-20,000 families (25-35% sequence identity)(But many more ‘unique’ folds/superfamilies ?)
Sequence-b
ased
Sequence-b
ased
Struct
ure-b
ased
Struct
ure-b
ased
CurrentState
CurrentState
Reduction -How??
13
Belgium 10/03
From sequence to From sequence to structure structure
Sequence-b
ased
Sequence-b
ased
Struct
ure-b
ased
Struct
ure-b
ased
ChallengeChallenge
Reduction -How??
Problem: Most structurally similar pairs share <20% aa identityMany structurally similar pairs share only few key aa (5-8%, background)
Most (all) sequence search engines cannot find a ‘significant’ similarity below 35-40% aa identity
So, can we cross the line to the ‘Twilight Zone’ (20-35% aa identity) to the ‘midline zone’ (<20% aa identity)
14
Belgium 10/03
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
15
Belgium 10/03
ProtoClass - Set of automatic classifications of all proteins
Seeking statistically significant regularities (clusters)
Reconstruct the ‘geometry’ of the sequence space
Guiding principleHomologous proteins evolved from common ancestor protein
Homology is a transitive relation that can be deduced based on statistical similarities
Belgium 10/03 Michal Linial 16
Global classifications of all proteins
ProtoClass systems generate graphs and maps that yield views at any levels of granularity.
ProtoMap release May 1997
ProtoNet - A (arithmetric) release July 2002ProtoNet - G (geometric) release July 2002ProtoNet - H (harmonic) release July 2002
Proto3D + ProtoNet -T October 2003
ProtoNet - A50 July 2003Proto3D - A50 July 2003
ProtoClassProtoClass
Belgium 10/03 Michal Linial 17
Pre-Computation
• SwissProt release 40.28 database (ProtoNet 2.4)– 114,000 SWP proteins – 133,000 + 850,000 TrEMBL sequences (ProtoNet 3.0)
• All-against-all similarity scores by gapped BLAST– Using BLOSUM62, eliminating low-complexity (also other matrices, BLOSUM 50, PAM 250..)
• BLAST identified >13M relations between 114K SWP proteins – sequence similarity E-Score of 100 !!! or less is collected
Belgium 10/03 Michal Linial 18
Includes all SwissProt proteins (130K) TrEMBL proteins (850K)
Hierarchical
Graph based
Pairwise distances (all against all BLAST search)
Unsupervised and automated
ProtoNet main features
Bottom-up clustering
ProtoClassProtoClass
The clustering algorithm is based on a ‘merging score’
Belgium 10/03 Michal Linial 19
ProtoNet top 20 ProtoNet top 20
20 largest clusters in the ProtoNet at pre-selected horizontal level (7K) Added hypothetical proteins 7-15% 15-20%
ProtoClassProtoClass
Belgium 10/03 Michal Linial 20
Towards functional Map
Roadmap of Ig Superfamily
Yona G., Linial N., Linial M. Proteins 37:360-378 (1999)
Edges connect clusters that are neighbors but failed to merge at that LEVEL of the graph
Many pairs of proteinswith <<20% aa identity
21
Belgium 10/03
Seeking missing folds
A rational computational procedures for identifying ‘missing/hidden’ folds/SF
What is missing:
Constructing the protein sequence space as a guideline for structural fold spaceCrossing the twilight zone
Our approach:
GoalGoal
Bridging Structure & Bridging Structure & SequenceSequence
Hypothesis: Distances in the graph (road-map) are consistent with distances between protein features, including their structure.
Practically: Unsolved clusters that are ‘remote’ (in the road-map) from an already solved structure will have higher chance to have new folds or new superfamilies.
Seq-Strmap
Seq-Strmap
Distance measure via Distance measure via Structural perspectiveStructural perspective
Good target?
In PDB
Good target?
Clo
ck -
Pair
Tim
e
Create Proto3D(all SWP+all PDB domains)(114K+36K= 150K)
Seq-Strmap
Seq-Strmap
24
Belgium 10/03
GlobinsGlobinsExampleExample
Short (~120-160 aa)Oxygen transport in multi-cell organismsSingle domain
Spread in evolutionEarly evolutionary duplicationsSequence similarity <15%
SCOP identified 50 ! different family members (neuronal, plant…)
Some biological Road MapsSome biological Road Maps
SCOP
Fold: Globin - like
SF: A.Globin-like B. helical ferredoxin
Fam A: 1. Globin (50)2. 3. Neural globin (1)4.
Fam B: 1. 2.
Seq-Strmap
Seq-Strmap
All 850 proteins are globin relatedAll belong to one SF (SCOP)
26
Belgium 10/03
Mapping SCOP structure on the Mapping SCOP structure on the
Sequence-based clustersSequence-based clusters Mapping SCOP structure on the Mapping SCOP structure on the
Sequence-based clustersSequence-based clusters
Sassson et al (2003) Nucl. Acids Res. 31
Murzin A. G. et al. (1995). J. Mol. Biol. 247, 536-540
ProtoNet SCOP
Currently~2000 fam
A very good correspondence between clusters and SCOP families
27
Belgium 10/03
Seeking new folds
We developed a navigating procedure that measures ‘distances’ among protein clusters in the graph of view of proteins that were already solved (X-ray, NMR)
Structural information is embedded in the roadmap of ProtoClass (I.e., globins)
Our approach:
28
Belgium 10/03
Computational Approach for Target Selection
Adding Structures to the map
ProtoNet (at a selected level)~10,000 clusters ; 2000 clusters > 15 proteins each
SCOP 1.50 (2000)~10,500 PDB structures, 24,000 domains (redundant)
Each structural domain is mapped to its proteins (and its cluster). ‘Occupied’ clusters are those with at least one solved structural domain.
29
Belgium 10/03
Mapping ‘Structures’ on the Protein Graph
Mapping ‘Structures’ on the Protein Graph
Databases usedDatabases used
StructuralStructuralAll PDB entriesAll PDB entriesSequenceSequenceProtoClass (I.e. ProtoMap, ProtoNet)ProtoClass (I.e. ProtoMap, ProtoNet)
occupiedoccupied
occupiedoccupied
~only 1800 clusters are ‘occupied’. They accounts for ~50% of all proteins in the protein map.
30
Belgium 10/03
occupiedoccupied
A
A distance measure in the graphA distance measure in the graph vacantvacant surrounding volumessurrounding volumes
A distance measure in the graph (VSV):the vacant-surrounding-volume of a clusteris the number of clusters before encountering an occupied cluster
Steps= 3VSV = 11
Clusters are associated with VSV (if at least one structure is in the local map)
All clusters are sorted according to their VSV.
31
Belgium 10/03
Prioritized Target List
Higher VSV, higher chance for NEW SUPERFAMILY ?
32
Belgium 10/03
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
33
Belgium 10/03
Testing the predicting power of the VSV navigation method
VSV & NEW SUPERFAMILY ?The membranous protein test
All clusters
Membranous
Most clusters with membranous proteins have much higher VSV.This is in accord with the fact that very very small number of membranous proteins were solved(50 out of 20,000)
34
Belgium 10/03
Validation against new dataValidation against new data
SCOP 1.37 (~12,000 records )
~800 families~570 superfamilies~410 folds
SCOP 1.50 (~23,800 records )~1300 families, ~820 superfamilies ~550 folds
As test set
As base set
35
Belgium 10/03
Validation against new dataValidation against new data
Test the prediction by the VSV method (BASE set) with the actual assignment of new SF in recent data (TEST set).
BASE SET ~570 superfamilies
TEST SET ~820 superfamilies
250 additional new SF
The Base Set and the Test Set have no overlap
36
Belgium 10/03
Testing the predicting power of the VSV navigation method
VSV & NEW SUPERFAMILY ?
Statistical test
Prediction is based on 13,000 domains (1999)
Test is based on new added 11,000 domains (2001)
673
123
128
78
46
42
0
100
200
300
400
500
600
700
800
VSV=all VSV1 VSV2
New SF
37
Belgium 10/03
VSV according to set of SCOP 1.37 to 1.50
VSV
0%
20%
40%
60%
80%
100%
1 2 3 4 5 6 7
vsv3
Series2
Series1
3 4 5 6 7 8 10
NEW
Known
Our hypothesis is confirmed - the higher the VSV is, the chance of a protein to belong to a new SF increases
38
Belgium 10/03
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
39
Belgium 10/03
ProTarget - a web site that assign a ‘SCORE’ for proteins according to their probability to belong to new superfamily (or fold)
ProTarget - a web site that assign a ‘SCORE’ for proteins according to their probability to belong to new superfamily (or fold)
Back from Prediction to the experimentalists
We suggest a ranked list that is ‘BEST’ for SG projects. The user may select any subset
40
Belgium 10/03
ProTarget
41
Belgium 10/03
42
Belgium 10/03
Development in ProTarget
Dynamic view- Proteins that have been solved affect the the map and of course the VSV ranking .
Can I find the group of proteins that once ‘solved’ their impact is maximal (affected the ranking of at least X proteins)
Other features
Including domain composition to the VSV ranking method (coming)
When New structure is solved (or about to be solved), a new map is created with new VSV and prioritization is done automatically
Using the dynamic option,redundancy in solving similar structures is reduced
44
Belgium 10/03
Using ProTarget dynamically Using ProTarget dynamically
45
Belgium 10/03
To the experimentalist (SG center) Structural Genomics Projects
Targets
Cloning
Expression
Solubility
Crystallization
46
Belgium 10/03
Considerations in solving structuresConsiderations in solving structures
BiologicalBiological PracticalPractical
•Quantity - sources•Folding properties• Expression system• Intrinsic stability• Bad history, membranous• Glory, money and fame•.…….
•Quantity - sources•Folding properties• Expression system• Intrinsic stability• Bad history, membranous• Glory, money and fame•.…….
• Novel biological activity ?• Selectivity? Specificity?• Ligand / drug binding ?• Disease related?• Drug design relevance?•…...
• Novel biological activity ?• Selectivity? Specificity?• Ligand / drug binding ?• Disease related?• Drug design relevance?•…...
ComputationalComputational
47
Belgium 10/03
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
Belgium, 10/0348
Disease
Evolution
Genes, regulation
The Subway, Tube, Underground, Metro, U-Bahn
Protein
Belgium, 10/0349
Sequence and Function relationship
taking one example: Enzymeswell characterized
functionality is definedconserved
essential, testabletree like classification
Belgium, 10/0350
Relatively easy ‘function’
ENZYMES
DB -Enzyme, WIT, KEGG etc
Belgium, 10/0351
Structurally based alignments of structurally and functionally characterized sequences
(Human)
90%
(Chick)
45%(E coli)
(E coli)
(B ster.)
20%
(E coli)
(Yeast)
Sequence 5.3.1.1 (TP Isomerase)
SameExact
5.3.1.1 (TP Isomerase)
BothClass 5 (isom.)
5.3.1.1 (TP Isomerase)
5.3.1.24 (PRA Isomerase)
5.3.1.15 (Xylose Isom.)
DifferentClasses
4.1.3.3 (Aldolase)
4.2.1.11 (Enolase)
Function
Belgium, 10/0352
010
2030
4050
6070
8090
100
010203040506070
Relationship of Similarity in Sequence to that in Function
%ID
Sequence similarity of pairs of proteins
% S
ame
Fu
nct
ion
Percentage of pairs that have same precise (enzymatic) function
M. G
ers
tein
, Yale
Belgium, 10/0353
010
2030
4050
6070
8090
100
010203040506070
Relationship of Similarity in Sequence to that in Function
%ID
% S
ame
Fu
nct
ion
M. G
ers
tein
, Yale
Belgium, 10/0354
Can transfer both
Fold & Functional Annotation
010
2030
4050
6070
8090
100
010203040506070
Relationship of Similarity in Sequence to that in Function
%ID
% S
ame
Fu
nct
ion
M. G
erstein M. G
ers
tein
, Yale
Belgium, 10/0355
Can not transfer Fold or Functional
Annotation("Twilight Zone")
Can transfer Annotation related
Fold but not Function
Can transfer both
Fold & Functional Annotation
010
2030
4050
6070
8090
100
010203040506070
Relationship of Similarity in Sequence to that in Function
%ID
% S
ame
Fu
nct
ion
M. G
erstein M. G
ers
tein
, Yale
Belgium, 10/0356
Limitations in functionalal transfer between sequences
Functional knowledge via annotations of protein sets
Belgium, 10/0357
Few words on
Large Scale experiments
Protein Annotation
Belgium, 10/0358
“omics”: genomics and proteomics
• Main idea: use high throughput as a mean of tackling biological complexity.
DIGE 2d gel DNA microarray SELDI-TOF spectrum
Belgium, 10/0359
“omic” research
• Experimental Stage: data collection
• Computational Stage: statistical analysis
• Result: “graveyards” of genes/proteins
CD44 HSP CAT ERP2 RPL1 ENO
SODa TRD PMS DUF ACT GLU
Belgium, 10/0360
A protein graveyardCRZ1 HMO1 POL1 SNU13 RPC10 SFL1 SNU13 RPC10BEM2 NHP6A DPB3 PRP19 RPB8 GAL4 PRP19 RPB8SYF1 EPL1 POL2 KEM1 RPB10 MIG1 KEM1 RPB10CDC13 SCC4 RFA2 SEH1 RPO26 HSF1 SEH1 RPO26SHE3 RSC9 RFA3 NPL6 STB4 MOT3 NPL6 STB4NCE4 ISW1 RFA1 HOT1 TOA2 STE12 HOT1 TOA2- ISW2 RFC3 DAL82 TOA1 NUT1 DAL82 TOA1ECM5 TRA1 RFC2 ACE2 SUA7 BDF1 ACE2 SUA7EAF3 - RFC4 BUR6/NCB1TAF1 UME6 BUR6/NCB1TAF1HFI1 IOC3 TOP1 NCB2 TAF9 MMS4 NCB2 TAF9MSI1 - TOF2 SSU72 TAF10 ABF2 SSU72 TAF10CAC2 CST6 RNH35 KIN28 TAF11 GAT1 KIN28 TAF11RSC1 or 2 HOF1 FOB1 MOT1 TAF3 RTG1 MOT1 TAF3RSC6 ACT1 SSA3 RPB9 TAF6 SKN7 RPB9 TAF6RSC8 ARP4 SSA2 RPB2 TAF12 TAF4/MPT1RPB2 TAF12STH1 ARP9 GLC7 RPB7 TAF7 SIR2 RPB7 TAF7SFH1 ARP8 TDH1, 2, 3RPO21 TAF5 MSN2 RPO21 TAF5RSC2 ARP7 PDC2 RPB4 SPT15 MET31 RPB4 SPT15CHD1 APN1 HPR5 RPB3 TAF2 HAC1 RPB3 TAF2SMC3 PHR1 - MED8 TAF8 SSL2 MED8 TAF8IRR1 NTG2 RIM1 SRB8 TFA2 RAD3 SRB8 TFA2SWI3 MSH6 MGM101 MED2 TFA1 UBP8 MED2 TFA1SNF12 MSH2 CBF5 SRB7 TFG1 CCT4 SRB7 TFG1SNF2 RAD26 CBF2 SSN2 TAF14/TFG3RPL10 SSN2 TAF14/TFG3SWI1 RPH1 CHL1 SRB4 TFG2 RPP0 SRB4 TFG2GCN5 MUS81 CDC14 FHL1 TFB4 RPL11A or BFHL1 TFB4SPT7 MEC1 SMC1 SRB5 TFB3 RPL12A or BSRB5 TFB3NGG1 RAD52 SMC2 SRB2 TFB2 RPL15B SRB2 TFB2SPT3 RAD59 SGS1 MED6 TFB1 RPL19A or BMED6 TFB1ADA2 MSH3 YCS4 RGR1 SSL1 RPL1A or BRGR1 SSL1YNG2 RAD7 MCD1 MED11 CCL1 RPL25 MED11 CCL1SPT8 RAD4 SCC2 SIN4 REB1 RPL3 SIN4 REB1SPT20 RAD14 CFT1 CSE2/MED9FHL1 RPL30 CSE2/MED9FHL1ESA1 RAD23 YSH1 GAL11 SUB1 RPL34A or BGAL11 SUB1RPD3 DPB4 REF2 MED7 GIS1 RPL35A or BMED7 GIS1HTB2 or 1 RFC5 NAM7 MED4 ARO80 RPL4A or BMED4 ARO80RSC58 RFC1 PAP1 MED1 FKH2 RPL8A or BMED1 FKH2IOC4 TOP2 PRP43 RPB11 - RPS0A or BRPB11 -ITC1 TOP3 PRP9 ROX3 IXR1 RPS11A or BROX3 IXR1NHP10 MIP1 PRP46 SRB6 SGV1 RPS1A SRB6 SGV1
Belgium, 10/0361
Biological analysis of protein sets
• Biological interpretation requires intimate knowledge of the proteins and is time-consuming.
• Usually only a few proteins are examined.• How can we interpret the results efficiently?• How can we understand the results of automatic
classification (I.e. ProtoNet clusters)?
• Solution: analysis of protein annotations.
Belgium, 10/0362
Protein annotations
• Annotation (keyword): a binary property of a protein, from a “library” of properties.
• Cover various biological aspects: function, structure, taxonomy, localization, biological pathway…
• Annotations come from different sources.
• Growth in annotation amount and variety.
• Libraries of annotations allow computational analysis.
Belgium, 10/0363
Gene Ontology (GO)Gene Ontology (GO)
GO provides controlled
annotations of :
1. Molecular function
2. Biological process
3. Cellular component
The annotations are part of a hierarchical graph, in which each GO term has a parent or parents, and might have child terms.
Belgium, 10/0364
Annotation typesSome non informativewords -complete genomeDisease…
Some are partiallyAnnotated EC x.y.
Growing very fastStill many terms are inconsistent
Belgium, 10/0365
Protein annotations
9% of SWP proteins have more than 20 annotation per protein (not including Taxonomy)
Belgium, 10/0366
Computational analysis – naïve
Something is missing…
60membrane
40enzyme
amountannotation
100 proteins:
Summation: a naïve method for protein set analysis.
Belgium, 10/0367
Intersection and inclusion
60 membrane40 enzyme
enzymemembrane membraneenzyme
enzyme
membrane
68
Belgium 10/03
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
Belgium, 10/0369
• A web-base tool aimed at biological analysis of protein sets.
• Biological information is shown through intersection and inclusion.
• Goal: provide a “biological roadmap” of the protein set.
Belgium, 10/0370
enzyme
cytoplasm
hydrolase
transcription
nucleus
kinase
Method
enzyme
cytoplasm
hydrolase
transcription
nucleus
kinase
P1
100110
P2
110110
P3
111000
P4
111001
P5
111000
P6
111001
enzyme
hydrolase
cytoplasmnucleus
transcription
kinase
cytoplasmtranscription
nucleus
P1 P2 P3 P4 P5 P6
P2 P3 P4 P5 P6
P3 P4 P5 P6
P4 P6
P1 P2
P2
=6
=5
=4
=2
=2
=2
Belgium, 10/0371
10101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000101111000000010101011111110100101010100010000010010010101010001000001001001010101000100000100100101010100010101011111111111101000000101010111111111110000000101010111111111110000000101010111111111111101001010100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111000000000011111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111111110000000010101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000101111000000010101011111110100101010100010000010010010101010001000001001001010101000100000100100101010100010101011111111111101000000101010111111111110000000101010111111111110000000101010111111111111101001010100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111000000000011111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111111110000000101010111111110101010010111100000001010101111111111111100000001010101111111111101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000010010010101010001000001001001010101000100000100100101010100010000011111000000010101011111110010010101010001000001001001010101000100000100100101010100010000010010010101011111000000010101011111110001000001001001010101000100000100100101010100010000010010010000000011111111111111111111111000000001010100001111111111111111111000000010101011111111111000000010101011111111111000000010101011111111111000000010101011111111111000000010101011000000000000000000001111000000010101011111111111000000101010101001100101010111111100101010101001011111000000010101011111111111111010101000101010010111111111110000000101010111111111111010000001010101111111111100000001010101111111111100000001010101111111111111010010101000000010101011111111111000000010101011111111111000000010101011111111111000000010101011110000000000111111100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111111111100000001010101111111101010100101111000000010101011111111111111000000010101011111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101110000000101010111111111110000000101010111111100001010001111000000010101011111101010100111111000000010101111111101010100101111000000010101011111111111111000000010101011111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101110000000101010111111111110000000101010111111100001010001111000000010101011111101010100111111000000010101
Belgium, 10/0372
Graph complexity
• 20 keywords: >1,000,000 nodes
• This worst-case doesn’t occur for large K values in the protein-keyword world.
• Still, highly complex graphs do occur.
KK
n n
K2
1
Theoretical complexity:
Belgium, 10/0373
Re• A user-controlled threshold trading graph accuracy for simplicity.
• Represents the maximal level of error allowed, in proteins.
40
1022
8
40
22
10
35
15 15
14
35
16
Resolution = 2 proteins solution
74
Belgium 10/03
Structural Genomics• What for - the challenge• How - classification & methodology• Tests - validation scheme• In practice - ProTarget
Functional Genomics• What for - the challenge• How - Integration & methodology• Tests - examples• In practice -PANDORA
Outine
Belgium, 10/0375
Biological examples• Take set of all 576 proteins annotated by ‘GO molecular function’ as ‘anion
channel’.
• View through InterPro keywords (sequential signatures that described all ‘family/domain’ ).
Belgium, 10/0376
zoom
BASIC SET
GABA A receptor
Neurotransmitter-gated ion channel
Nicotinic acetylcholine
receptorVoltage-gated
chloride channel
Intracellular chloride channel
H+ transporting ATPase
Eukaryotic porin
InterProNumber of
proteins
Sensitivity: TP/(TP+FN)
red = FN white = TP
Belgium, 10/0377
InterPro
alpha subunit
beta subunit
gamma subunit
GABA A receptor
Belgium, 10/0378
TaxonomyEukaryota
chordata
drosophilla
C. elegans
human
chickenmammalia
rodentia
Belgium, 10/0379
Resolution:
0 proteins
Belgium, 10/0380
Resolution: 2
Belgium, 10/0381
Resolution: 8
Belgium, 10/0382
Resolution: 30
Belgium, 10/0383
Biological examples - experimental
• Comparative proteomic experiment: E. coli response to benzoic acid (Yan et al, 2002).
• A set of 51 proteins are down-regulated by a factor of 1.3 or more.
• Benzoic acid is known to inhibit E. coli growth (Lambert et al, 1997).
• Could we guess this without examining individual proteins?
Belgium, 10/0384
TransportMetabolism
Biosynthesis
Cell growth and/or maintenance
Amino acid biosynthesis
Vitamin biosynthesis
Protein biosynthesis
Nitrogen metabolism
Coenzyme biosynthesis
Lipid metabolism
Phosphate metabolism
Carbohydrate metabolism
GO biological process
Belgium, 10/0385
Conclusion
• PANDORA offers:– Interactive comprehensible graph display.– Full protein-keyword intersection and
inclusion relations.– User-controlled data simplification.– Integration of >6 annotation sources.– Detection of false annotations (new).– Quantitative view (new)
Belgium, 10/0386
www.pandora.cs.huji.ac.il
We come in peace…
Belgium, 10/0387
Summary
• ProtoNet - Validated clustering and functional maps www.protonet.cs.huji.ac.il
• ProtoView -Navigation tools
• ProTagert - Ranked list of protein for structural genomics www.protarget.cs.huji.ac.il
• PANDORA - Interactive comprehensible graph of any protein set
www.pandora.cs.huji.ac.il
From Sequence to Structure To Function
Belgium 10/03 88
Functional map
Structural map
From Sequence to Structure to Function
From Sequence to Structure to Function
Sequence map
89Belgium, 10/03
Thank youThank you
Ilona Kifer ProTarget & cross-validationOri Sasson Proto3DElon Portugaly Domain based classification - EVERESTOri Shachar ProtoNet & FSSP Avishy Vaaknin InterPro ValidationNoam Kaplan PANDORA and ValidationNati Linial ProtoNet Algorithm
The rest of ProtoNet Team (Hillel, Uri, Moriah, Alex, Yonatan, Hagit, Menachem)