Protein structure – introduction “Bioinformatics: genes, proteins and computers” Orengo, Jones...
-
Upload
duane-thornton -
Category
Documents
-
view
212 -
download
0
Transcript of Protein structure – introduction “Bioinformatics: genes, proteins and computers” Orengo, Jones...
Protein structure – introductionProtein structure – introduction““Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton Bioinformatics: genes, proteins and computers” Orengo, Jones and Thornton (2003).(2003).
Secondary structure elementsSecondary structure elements
-helix
-strand -sheet
Tertiary structure = protein foldTertiary structure = protein fold
complete 3-dimensional structure
why is it interesting? isn’t the sequence enough?
a key to understand protein function Structure-based drug design
detection of distant evolutionary relationships
the structure is more conserved!
Fold classificationFold classification
classification: clustering proteins into structural families
motivation?
profound analysis of evolutionary mechanisms constraints on secondary structure packing?
classification at domain level
hierarchical classification of protein domain structures in the
Brookhaven Protein Databank (PDB).domains are clustered at four major levels:
Class
Architecture
Topology
Homologous superfamily
Sequence family
CATH – Protein Structure ClassificationCATH – Protein Structure Classification
Classsecondary structure content: mainly ,mainly , – , low 2nd structure content.
Architecturegross orientation of secondary structures, independent of connectivity.
Topology ( = fold)clusters structures according to their topological connections.
CATH – hierarchical classificationCATH – hierarchical classification
CATH – architecturesCATH – architectures
CATH – architectures (cont.)CATH – architectures (cont.)
Homologous superfamily homologous domains identified by sequence similarity, and
structure similarity
Sequence family domains clustered in the same sequence families, with
sequence identity>35%
CATH – hierarchical classificationCATH – hierarchical classification
other classification schemes: SCOP, FSSP partial disagreement between them.
Growing demand for protein structures!Growing demand for protein structures!
PDB contains 20,868 structures
X-Ray and NMR have limitations.
WE NEED
FASTER METHODS!
GenBank contains 24,027,936 sequences!
Protein Structure PredictionProtein Structure Prediction
I) Ab initio = ‘from the beginning’
- Simulation (physics)
- search for conformation with lowest energy
- Knowledge-based (i.e. statistics)
protein sequence: RGYSLGNWVC KVFGRCELAA
AMKRHGLDNY AAKFESNFNT
QATNRNTDGS TDYGILQINS
RWWCNDGRTP GSRNLCNIPC
SALLSSDITA SVNCAKKIVS
DGNGMNAWVA WRNRCKGTDV
Limited to very short peptides!
Can known structures assist prediction?Can known structures assist prediction?
the number of possible folds seems to be limited!
CATH inspection: more then 36,000 domains, but...
only ~800 topology groups
Total of "new folds" (light blue) and "old folds" (orange) for a given year
PDB inspection:
a ‘new’ protein has
a good chance to be
of a known structure!
Template-based prediction (fold recognition)Template-based prediction (fold recognition)
II) Comparative modeling (homology modeling)
- alignment with homologous sequence of known structure.
- high sequence identity areas: similar structure
- variable areas: must be builtcan’t be used if no sequence similarity found!
III) Threading
- alignment with structure sequences in fold library
- sophisticated scoring function finds most similar fold
- ‘Threading’ aligns target sequence onto template structure
““What are the baselines for protein What are the baselines for protein fold recognition?” fold recognition?”
McGuffin, Bryson and Jones (2001)McGuffin, Bryson and Jones (2001)
Goals:
1. what constitutes a baseline level of success for protein
fold recognition methods, above random guesswork?2. can simple methods that make use of 2nd structure
information assign folds more reliably?
3. how valuable might these methods be in the rapid
construction of a useful hierarchical classification?
1. Absolute difference in length
2. Absolute difference in number of secondary structure elements 3. Simple alignment of secondary structure elements 4. Alignment of secondary structure elements (Przytycka et al., 1999) 5. Alignment of secondary structure elements without additional scoring 6. Alignment of secondary structure elements using DSSP as secondary structure assignment 7. Alignment of secondary structure elements with gap penalty 8. Alignment of secondary structure elements with gap penalty for long elements
9. Alignment of secondary structure elements with absolute difference in length as scoring scheme
10. Alignment of full length secondary structure strings 11. Alignment of primary sequence
shorten 2nd structure strings:CCCHHHHCCCEEECCHHHCCC HCECH.
pairwise alignment
scoring function also considers length of elements
The methods evaluated The methods evaluated (ordered by complexity and runtime)(ordered by complexity and runtime)
A representative set of protein domainsA representative set of protein domains
a set of 1087 domains representing different
“Sequence Families” was selected from CATH.
1. >1atx00 2. GAAaLbKSDGPNTRGNSMSGTIWVFGcPSGWNNbEGRAIIGYacKQ 3. EEE TTS S TTSSEEEEEESS TT EEE SSSSSEEEE 4. CEEEEEHHECEEEECCCECEEEECCCEECCEECEEECCEECEEEEC
generate an informative file for each domain:
First evaluation: true positive percentageFirst evaluation: true positive percentage
compare true positive percentage, at a fixed 3% false positive.
run each method on all possible pairs from the 1087 set
(a,b) (a,c) (a,d) ... (g,d) (g,e) ... (k,f) ... (r,s) .... ~590,000 pairs
CATH (g) != CATH (e)
CATH (r) = CATH (s)
CATH (a) != CATH (b)
CATH (a) = CATH (d)
for each list: go top downward, and compare assignment to CATH
true counter =
false counter =
0
0
1
1
2
CATH (k) != CATH (f)23
STOP!
3% false positives reached.
true positive for this method = 2%
Sort each score list by descending similarity score.
(a,d) 0.99
(g,e) 0.98
(r,s) 0.87
|
(a,b) 0.63
(k,f) 0.45
(g,d) 0.37
•lets assume there are
100 structurly similar pairs
And 100 dissimilar pairs
We need lower,upper controls to compare withWe need lower,upper controls to compare withlower control: intelligent guesswork
1. randomly assign CATH topology codes according to frequency
2. calculate true positive, false positive percentage
upper control: automated recognition (given the 3D structure)
1. FSSP, SCOP and CATH databases were screened for all
dissimilar domains that exist in the three of them.
2. FSSP gave similarity scores to all possible pairs.
3. FSSP assignments compared against CATH, and against SCOP.
Optimisation of similarity scoring methods: Optimisation of similarity scoring methods: “Class pre-filter”“Class pre-filter”
each domain was assigned a class according to 2nd structure:
percentage of residues constituting -helices / -strands
domain “1cgt03”
80% of AA in -strand
10% of AA in -helix
most accurate is method number 5: “Alignment of secondary structure elements without additional scoring”, with: 27.18% true positive.
partial agreement between classification schemes: FSSP compared with SCOP: 61.1%, FSSP compared with CATH: 46.7%
methods that use 2nd structure alignments are in better agreement with CATH
accuracy ordering of methods doesn’t correspond to their relative complexity
methods that use 2nd structure usually don’t benefit notably from class pre-filter.
Second evaluation: CASP-like sensitivitySecond evaluation: CASP-like sensitivity
similarly to CASP – we measure the sensitivity of each method:
what is the probability of a method correctly assigning a fold?
lower control: a random proportional fold assignment
upper control: FSSP was used as a scoring method
Sensitivity results:Sensitivity results:
method 5 wins again: 31.8% sensitivity.
other 2nd structure based methods with small gap.
sensitivity order of the methods ~ true positive percentage order.
Similarity trees - can we construct classification?Similarity trees - can we construct classification?
Best method’s similarity scores for all pairs were
clustered into a tree.
a. globin-like <>
casein kinase
b. immunoglobulin-like <>
thrombin subunit H
whole tree:
generally disordered
1ckjA2
1irk02
1phk02
1ampE2
1hcl02
1ckjA2
1gdj00
1kobA2
1hbg00
1babA0
1lhs00
1mba00
1eca00
1ithA0
1ash00
1flp00
1sctA0
1cpcA0
1ddt02
1colA0
(a) (b)1bec01
1tcrA2
1edhA2
1nfkA1
1itbB1
1cgt03
1svpA2
1jxpA2
1try02
1sgt02
1sgpE1
1sgpE2
1dar02
ConclusionsConclusions
1. Baseline level to be exceeded by fold recognition methods:
27% true positive assignments allowing 3% false positive;
sensitivity level of 32%.
2. methods which make use of 2nd structure information
seem more accurate and sensitive than those who don’t.
3. simple 2nd structure alignments alone can not construct
reliable classification hierarchy.
4. the agreement between FSSP, SCOP and CATH
classification schemes is surprisingly low.