Lesson 5
description
Transcript of Lesson 5
![Page 1: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/1.jpg)
11
Lesson 5Lesson 5
Protein Prediction and Protein Prediction and ClassificationClassification
![Page 2: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/2.jpg)
22
Learning about a proteinLearning about a protein
What does a protein do??What does a protein do?? Post-translational modifications – Post-translational modifications –
phosphorylation, glycosylation, etc.phosphorylation, glycosylation, etc. Identifying patterns, motifsIdentifying patterns, motifs Secondary structureSecondary structure Tertiary/quaternary structureTertiary/quaternary structure Protein-protein interactionsProtein-protein interactions
![Page 3: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/3.jpg)
33
Domains & MotifsDomains & Motifs
![Page 4: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/4.jpg)
44
DomainsDomains
An analysis of known 3-D protein An analysis of known 3-D protein structures reveals that, rather than structures reveals that, rather than being monolithic, many of them being monolithic, many of them contain multiple folding unitscontain multiple folding units. .
Each such folding unit is a domain Each such folding unit is a domain (>50 aa, < 500 aa)(>50 aa, < 500 aa)
![Page 5: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/5.jpg)
55
calcium/calmodulin-dependent protein kinase
SH2 domain: interact with phosphorylated tyrosines, and are thus part of intracellular signal-transuding proteins. Characterized by specific sequences and tertiary structure
![Page 6: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/6.jpg)
66
What is a motif??What is a motif??
A sequence motifA sequence motif = a certain = a certain sequence that is widespread and sequence that is widespread and conjectured to have biological conjectured to have biological significancesignificance
Examples:Examples:KDELKDEL – ER-lumen retention signal – ER-lumen retention signalPKKKRKVPKKKRKV – an NLS (nuclear – an NLS (nuclear localization signal)localization signal)
![Page 7: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/7.jpg)
77
More loosely defined motifsMore loosely defined motifs
KDEL (usually)KDEL (usually)++
HDEL (rarely) HDEL (rarely) ==
[HK]-D-E-L:[HK]-D-E-L:H H oror K at the first position K at the first position
This is called a pattern (in Biology), or This is called a pattern (in Biology), or a regular expression (in computer a regular expression (in computer science)science)
![Page 8: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/8.jpg)
88
Syntax of a patternSyntax of a pattern
Example:Example: W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].
![Page 9: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/9.jpg)
99
PatternsPatterns
W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].
Any amino, between 9-
11 times
F or Y or
V
WOPLASDFGYVWPPPLAWSROPLASDFGYVWPPPLAWSWOPLASDFGYVWPPPLSQQQ
![Page 10: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/10.jpg)
1010
Patterns - syntaxPatterns - syntax
The standard IUPAC one-letter codes. The standard IUPAC one-letter codes. ‘‘x’x’ : any amino acid. : any amino acid. ‘‘[]’[]’ : residues allowed at the position. : residues allowed at the position. ‘‘{}’{}’ : residues forbidden at the position. : residues forbidden at the position. ‘‘()’()’ : repetition of a pattern element are indicated in : repetition of a pattern element are indicated in
parenthesis. X(n) or X(n,m) to indicate the number or parenthesis. X(n) or X(n,m) to indicate the number or range of repetition. range of repetition.
‘‘-’-’ : separates each pattern element. : separates each pattern element. ‘‹’‘‹’ : indicated a N-terminal restriction of the pattern. : indicated a N-terminal restriction of the pattern. ‘›’‘›’ : indicated a C-terminal restriction of the pattern. : indicated a C-terminal restriction of the pattern. ‘‘.’.’ : the period ends the pattern. : the period ends the pattern.
![Page 11: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/11.jpg)
1111
Pattern ~ motif ~ signaturePattern ~ motif ~ signature
A A patternpattern (similarly to consensus and (similarly to consensus and profile) is a way to represent a profile) is a way to represent a conserved sequenceconserved sequence
Whereas a profile and consensus Whereas a profile and consensus usually relate to the entire sequence, usually relate to the entire sequence, a pattern usually relates to a a few a pattern usually relates to a a few tens of amino-acidstens of amino-acids
![Page 12: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/12.jpg)
1212
Profile-pattern-consensusProfile-pattern-consensus
AAAACCTTTTGG
AAAAGGTTCCGG
CCAACCTTTTCC
1122334455
AA0.660.66110000..
TT00000011..
CC0.330.33000.660.6600..
GG00000.330.3300..
AAAACCTTTTGG
[AC-]A-[GC]-T-[TC]-[GC]
multiple alignment
consensus
pattern
profile
•Information:
consensus<pattern<profile
NNAANNTTNNNN
![Page 13: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/13.jpg)
1313
InterproInterpro
Interpro: a collection of many protein Interpro: a collection of many protein signature databases (Prosite, Pfam, signature databases (Prosite, Pfam, Prints…) integrated into a Prints…) integrated into a hierarchical classifying systemhierarchical classifying system
![Page 14: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/14.jpg)
1414
Interpro exampleInterpro example
![Page 15: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/15.jpg)
1515
PTM – Post-PTM – Post-Translational Translational ModificationModification
![Page 16: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/16.jpg)
1616
PTM – Post-Translational PTM – Post-Translational ModificationModification
PhosphorylationPhosphorylationTyr, Ser, ThrTyr, Ser, Thr
GlycosylationGlycosylation(addition of sugars)(addition of sugars)Asn, Ser, ThrAsn, Ser, Thr
Addition of fatty acids (e.g. N-Addition of fatty acids (e.g. N-myristoylation, S-Palmitoylation)myristoylation, S-Palmitoylation)
![Page 17: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/17.jpg)
1717
So how to predictSo how to predict
Take into account:Take into account:
1.1. Context (motif):Context (motif):PKC (a kinase) recognizes PKC (a kinase) recognizes X S/T X R/KX S/T X R/KN-Myristoylation at M G X X X S/TN-Myristoylation at M G X X X S/TSeveral times – we don’t know the exact Several times – we don’t know the exact motif!motif!
2.2. ConservationConservationIs the motif found (for instance, in Is the motif found (for instance, in human) also conserved in related human) also conserved in related organisms (for instance, in chimp)?organisms (for instance, in chimp)?
![Page 18: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/18.jpg)
1818
Prediction problemsPrediction problems
Signal for detection is very shortSignal for detection is very short Not enough biological knowledge for Not enough biological knowledge for
characterizing the signalcharacterizing the signal Tertiary structureTertiary structure
![Page 19: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/19.jpg)
1919
Prediction will be more efficient if Prediction will be more efficient if more information is availablemore information is available
![Page 20: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/20.jpg)
2020
Secondary StructureSecondary Structure
![Page 21: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/21.jpg)
2121
Secondary StructureSecondary Structure
Reminder- Reminder- secondary structure is usually secondary structure is usually divided into three categories:divided into three categories:
Alpha helix Beta strand (sheet)Anything else –
turn/loop
![Page 22: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/22.jpg)
2222
Secondary StructureSecondary Structure
An easier question – what is the An easier question – what is the secondary structure when the 3D secondary structure when the 3D structure is known?structure is known?
![Page 23: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/23.jpg)
2323
DSSPDSSP
DSSPDSSP (Dictionary of Secondary (Dictionary of Secondary Structure of a Protein) – assigns Structure of a Protein) – assigns secondary structure to proteins secondary structure to proteins which have a crystal structurewhich have a crystal structure
H = alpha helix
B = beta bridge (isolated residue)
E = extended beta strand
G = 3-turn helix
I = 5-turn helix
T = hydrogen bonded turn
S = bend
![Page 24: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/24.jpg)
2424
Predicting secondary structure from Predicting secondary structure from primary sequenceprimary sequence
![Page 25: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/25.jpg)
2525
Chou and Fasman (1974)Chou and Fasman (1974)Name P(a) P(b) P(turn)
Alanine 142 83 66Arginine 98 93 95Aspartic Acid 101 54 146Asparagine 67 89 156Cysteine 70 119 119Glutamic Acid 151 037 74Glutamine 111 110 98Glycine 57 75 156Histidine 100 87 95Isoleucine 108 160 47Leucine 121 130 59Lysine 114 74 101Methionine 145 105 60Phenylalanine 113 138 60Proline 57 55 152Serine 77 75 143Threonine 83 119 96Tryptophan 108 137 96Tyrosine 69 147 114Valine 106 170 50
The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet breaker)
![Page 26: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/26.jpg)
2626
Chou-Fasman predictionChou-Fasman prediction
Look for a series of >4 amino acids which all have Look for a series of >4 amino acids which all have (for instance) alpha helix values >100(for instance) alpha helix values >100
Extend (…)Extend (…) Accept as alpha helix if Accept as alpha helix if
average alpha score > average beta scoreaverage alpha score > average beta score
Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr
α 142 57 69 113 113 114 114 100 106 142 83
β 83 55 147 138 138 74 74 87 170 83 119
![Page 27: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/27.jpg)
2727
Chou and Fasman (1974)Chou and Fasman (1974)
Success rate of 50%Success rate of 50%
![Page 28: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/28.jpg)
2828
Improvements in the 1980’sImprovements in the 1980’s
Conservation in MSAConservation in MSA Smarter algorithms (e.g. HMM, neural Smarter algorithms (e.g. HMM, neural
networks).networks).
![Page 29: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/29.jpg)
2929
AccuracyAccuracy
Accuracy of prediction seems to hit a Accuracy of prediction seems to hit a ceiling of 70-80% accuracyceiling of 70-80% accuracy
MethodMethodAccuracyAccuracy
Chou & FasmanChou & Fasman50%50%
Adding the MSAAdding the MSA69%69%
MSA+ sophisticated MSA+ sophisticated computationscomputations
70-80%70-80%
![Page 30: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/30.jpg)
3030
Gene OntologyGene Ontology
![Page 31: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/31.jpg)
3131
GOGO
GGeneene O Ontology – a project for ntology – a project for consistent descriptionconsistent description of gene of gene products in products in different databasesdifferent databases. .
Consistent descriptionConsistent description - Common key - Common key definitions. definitions.
Example:Example: ‘protein synthesis’ or ‘protein synthesis’ or ‘translation’‘translation’
![Page 32: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/32.jpg)
3232
GOGO
GO - GO describes proteins in terms of :GO - GO describes proteins in terms of :
biological processbiological process
cellular componentcellular component
molecular functionmolecular function
GO is GO is notnot::
– A sequence database.A sequence database.
– A portal for sequence informationA portal for sequence information
![Page 33: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/33.jpg)
3333
GO – structureGO – structure
nucleus
Nuclear chromosome
cellcellular componentcellular component
![Page 34: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/34.jpg)
3434
GO exampleGO example
Links from the swissprot entry of human protein kinase C alpha
![Page 35: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/35.jpg)
3535
Examples for use of GOExamples for use of GO
Enrichment for a GO category:Enrichment for a GO category:1.1. Do all up regulated genes in a Do all up regulated genes in a
microarray you built belong to the microarray you built belong to the same GO “molecular function” same GO “molecular function” category?category?
2.2. You have predicted a new You have predicted a new transcription factor binding site. Do transcription factor binding site. Do all genes with this site belong to the all genes with this site belong to the same GO biological process?same GO biological process?
![Page 36: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/36.jpg)
3636
Evaluation of prediction Evaluation of prediction methodsmethods
![Page 37: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/37.jpg)
3737
Evaluation of prediction methodsEvaluation of prediction methods
Comparing our results to experimentally Comparing our results to experimentally verified sitesverified sites
Positive (hit)Positive (hit)NegativeNegative
TrueTrueTrue-positiveTrue-positive
True-negativeTrue-negative
FalseFalseFalse-positiveFalse-positive(false alarm)(false alarm)
False-negativeFalse-negative(miss)(miss)
Our prediction gives:
Is t
he
pre
dic
tio
n c
orr
ect
?
![Page 38: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/38.jpg)
3838
Method evaluationMethod evaluation
Positive (hit)Positive (hit)NegativeNegative
TrueTrueTrue-positiveTrue-positive
True-negativeTrue-negative
FalseFalseFalse-False-positivepositive
(false alarm)(false alarm)
False-negativeFalse-negative(miss)(miss)
A good method will be one with a high level of A good method will be one with a high level of true-positives and true-negatives, and a low true-positives and true-negatives, and a low level of false-positives and false-negativeslevel of false-positives and false-negatives
Our prediction gives:
Is t
he
pre
dic
tio
n c
orr
ect
?
![Page 39: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/39.jpg)
3939
Calibrating the methodCalibrating the method
All methods have a parameter (or a All methods have a parameter (or a score) that can be calibrated to score) that can be calibrated to improve the accuracy of the method.improve the accuracy of the method.
For example: the E-value cutoff in For example: the E-value cutoff in BLASTBLAST
![Page 40: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/40.jpg)
4040
Calibrating E-value cutoffCalibrating E-value cutoff
Reminder: the lower the E-value, the Reminder: the lower the E-value, the more ‘significant’ the alignment more ‘significant’ the alignment between the query and the hit.between the query and the hit.
![Page 41: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/41.jpg)
4141
Calibrating the E-valueCalibrating the E-value
What will happen if we raise the E-value What will happen if we raise the E-value cutoff (for instance – work with all hits with cutoff (for instance – work with all hits with an E-value which is < 10) ?an E-value which is < 10) ?
Positive (hit)Positive (hit)NegativeNegative
TrueTrueTrue-positiveTrue-positive
True-negativeTrue-negative
FalseFalseFalse-positiveFalse-positive(false alarm)(false alarm)
False-negativeFalse-negative(miss)(miss)
Our prediction gives:
Is t
he
pre
dic
tio
n c
orr
ect
?
![Page 42: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/42.jpg)
4242
Calibrating the E-valueCalibrating the E-value
On the other hand – if we lower the E-value On the other hand – if we lower the E-value (look only at hits with E-value < 10(look only at hits with E-value < 10-8-8))
Positive (hit)Positive (hit)NegativeNegative
TrueTrueTrue-positiveTrue-positive
True-negativeTrue-negative
FalseFalseFalse-positiveFalse-positive(false alarm)(false alarm)
False-negativeFalse-negative(miss)(miss)
Our prediction gives:
Is t
he
pre
dic
tio
n c
orr
ect
?
![Page 43: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/43.jpg)
4343
Improving predictionImproving prediction
Trade-off between Trade-off between specificityspecificity and and sensitivitysensitivity
![Page 44: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/44.jpg)
4444
Sensitivity vs. specificitySensitivity vs. specificity
Sensitivity = Sensitivity =
Specificity =Specificity =
True positive
True positive + False negative
Represent all the proteins which are
really phosphorylated
True negative
True negative + False positive
Represent all the proteins which are
really NOT phosphorylated
How good we hit real
phosphorylations
How good we avoid real non-
phosphorylations
![Page 45: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/45.jpg)
4545
Raising the E-value to 10:Raising the E-value to 10:sensitivitysensitivityspecificityspecificity
Lowering the E-value to 10Lowering the E-value to 10-8-8
sensitivity sensitivity specificityspecificity
![Page 46: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/46.jpg)
4646
Over-predictions: exampleOver-predictions: example
Many PTM-predictors tend to Many PTM-predictors tend to over-over-predictpredict high level of false high level of false positives positives low specificity low specificity
WHY?WHY?
1.1. Tertiary structure! (buried/exposed, Tertiary structure! (buried/exposed, tertiary motifs)tertiary motifs)
2.2. The phosphorylation recognition The phosphorylation recognition mechanism is not completely clear!mechanism is not completely clear!
![Page 47: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/47.jpg)
4747
Next time on: Next time on:
Biological Sequences Biological Sequences AnalysisAnalysis
![Page 48: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/48.jpg)
4848
The Human GenomeThe Human Genome
![Page 49: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/49.jpg)
4949
Horizontal (Lateral) Gene TransferHorizontal (Lateral) Gene Transfer
![Page 50: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/50.jpg)
5050
Alternative splicingAlternative splicing
![Page 51: Lesson 5](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815567550346895dc33204/html5/thumbnails/51.jpg)
5151
Repetitive ElementsRepetitive Elements