Cédric Notredame (08/10/2015) Comparing Two Protein Sequences Cédric Notredame.
-
Upload
andrew-robbins -
Category
Documents
-
view
229 -
download
5
Transcript of Cédric Notredame (08/10/2015) Comparing Two Protein Sequences Cédric Notredame.
Cédric Notredame (19/04/23)
Our Scope
Pairwise Alignment methods are POWERFUL
Pairwise Alignment methods are LIMITED
If You Understand the LIMITS they Become VERY POWERFUL
Look once Under the Hood
Cédric Notredame (19/04/23)
Outline
-WHY Does It Make Sense To Compare Sequences
-HOW Can we Align Two Sequences ?
-HOW can I Search a Database ?
-HOW Can we Compare Two Sequences ?
Cédric Notredame (19/04/23)
Why Do We Want To Compare Sequences
wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA
EXTRAPOLATE
??????
Homology?
SwissProt
Cédric Notredame (19/04/23)
Why Does It Make Sense To Align Sequences ?
-Evolution is our Real Tool.
-Nature is LAZY and Keeps re-using Stuff.
-Evolution is mostly DIVERGEANT
Same Sequence Same Ancestor
Cédric Notredame (19/04/23)
Why Does It Make Sense To Align Sequences ?
SameSequence
Same Function
Same 3D Fold
Same Origin
Cédric Notredame (19/04/23)
An Alignment is a STORY
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
Mutations+
Selection
Cédric Notredame (19/04/23)
An Alignment is a STORY
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutation
InsertionDeletion
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
Mutations+
Selection
Cédric Notredame (19/04/23)
Evolution is NOT Always Divergent…
AFGP with (ThrAlaAla)nSimilar To Trypsynogen
N
AFGP with (ThrAlaAla)n
S
Chen et al, 97, PNAS, 94, 3811-16
NOT
Similar to Trypsinogen
Cédric Notredame (19/04/23)
Evolution is NOT Always Divergent
AFGP with (ThrAlaAla)nSimilar To Trypsynogen
AFGP with (ThrAlaAla)nNOT
Similar to Trypsinogen
N
S
SIMILAR Sequences BUT
DIFFERENT origin
Cédric Notredame (19/04/23)
Evolution is NOT always Divergent…
But in MOST cases, you may assume it is…
SameSequence
Same Function
Same 3D Fold
Same Origin
Similar Function DOES NOT REQUIRESimilar Sequence
Similar Sequence
Historical Legacy
Cédric Notredame (19/04/23)
How Do Sequences Evolve ?
CONSTRAINED Genome Positions Evolve SLOWLY
EVERY Protein Family Has its Own Level Of Constraint
Family KS KA
Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8
Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)Ks Synonymous Mutations, Ka Non-Neutral.
Cédric Notredame (19/04/23)
GC
LIV A
F
Aliphatic
Aromatic
Hydrophobic
C
How Do Sequences Evolve ?The amino Acids Venn Diagram
To Make Things Worse, Every Residue has its Own Personality
ST
WY
QHK
R
ED N
Polar
PG
Small
C
Cédric Notredame (19/04/23)
How Do Sequences Evolve ?
In a structure, each Amino Acid plays a Special Role
OmpR, Cter Domain
In the core, SIZE MATTERS
On the surface, CHARGE MATTERS
--+
Cédric Notredame (19/04/23)
How Do Sequences Evolve ?
Accepted Mutations Depend on the Structure
Big -> BigSmall ->SmallNO DELETION
--+
Charged -> ChargedSmall <-> Big or SmallDELETIONS
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?
To Compare Two Sequences, We need:
Their Function
Their Structure
We Do Not Have Them !!!
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?
We will Need To Replace Structural Information With Sequence Information.
SameSequence
Same Function
Same 3D Fold
Same Origin
It CANNOT Work ALL THE TIME !!!
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?
To Compare Sequences, We need to Compare ResiduesWe Need to Know How Much it COSTS to SUBSTITUTE
an Alanine into an Isoleucinea Tryptophan into a Glycine…The table that contains the costs for all the
possible substitutions is called the SUBSTITUTION MATRIX
How to derive that matrix?
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?
G
C
LIV A
F
Aliphatic
Aromatic
Hydrophobic
C
ST
W
YQH
K
R
ED N
Polar
PG
Small
C
Using Knowledge Could Work
But we do not know enough about Evolution and Structure.
Using Data works better.
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Making a Substitution Matrix
-Take 100 nice pairs of Protein Sequences, easy to align (80% identical).
-Align them…
-Count each mutations in the alignments
-25 Tryptophans into phenylalanine-30 Isoleucine into Leucine…
-For each mutation, set the substitution score to the log odd ratio:
Expected by chance
ObservedLog
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Making a Substitution Matrix
The Diagonal Indicates How Conserved a residue tends to be.W is VERY Conserved
Some Residues are Easier To mutate into other similar
Cysteins that make disulfide bridges and those that do not get averaged
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Using Substitution Matrix
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutation
InsertionDeletion
Given two Sequences and a substitution Matrix,We must Compute the CHEAPEST Alignment
Cédric Notredame (19/04/23)
Most popular Subsitution Matrices • PAM250• Blosum62 (Most widely used)
Raw Score
TPEA¦| |APGA
TPEA¦| |APGA
Score =1 = 9
• Question: Is it possible to get such a good alignment by chance only?
+ 6 + 0 + 2
Scoring an Alignment
Cédric Notredame (19/04/23)
Insertions and Deletions
Gap Penalties
• Opening a gap is more expensive than extending it
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
gap
Gap Opening PenaltyGap Extension Penalty
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Limits of the substitution Matrices
They ignore non-local interactions and Assume that identical residues are equal
They assume evolution rate to be constant
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLNADKPKRPLSAYMLWLN
Mutations+
Selection
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Limits of the substitution Matrices
Substitution Matrices Cannot Work !!!
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Limits of the substitution Matrices
I know… But at least, could I get some idea of when they are likely to do all right
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?The Twilight Zone
Length
%Sequence Identity
100
Same 3D Fold
Twilight Zone
Similar SequenceSimilar Structure
30%
Different SequenceStructure ????
30
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?The Twilight Zone
Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Which Matrix Shall I used
The Initial PAM matrix was computed on 80% similar Proteins
It been extrapolated to more distantly related sequences.
Pam 250Pam 350
Other Matrices Exist:BLOSUM 42BLOSUM 62
BLOSUM 62
Cédric Notredame (19/04/23)
How Can We Compare Sequences ?Which Matrix Shall I use
PAM: Distant Proteins High Index (PAM 350)BLOSUM: Distant Proteins Low Index (Blosum30)
•GONNET 250> BLOSUM62>PAM 250.
•But This will depend on:
•The Family.•The Program Used and Its Tuning.
Choosing The Right Matrix may be Tricky…
•Insertions, Deletions?
Cédric Notredame (19/04/23)
Dot MatricesGlobal AlignmentsLocal Alignment
HOW Can we Align Two Sequences ?
Cédric Notredame (19/04/23)
Dot Matrices
>Seq1THEFATCAT>Seq2THELASTCAT
T H E F A T C A TTHEFASTCAT
Window
Stringency
Cédric Notredame (19/04/23)
Dot MatricesStrigency
Window=1Stringency=1
Window=11Stringency=7
Window=25Stringency=15
Cédric Notredame (19/04/23)
Dot MatricesLimits
-Visual aid
-Best Way to EXPLORE the Sequence Organisation
-Does NOT provide us with an ALIGNMENT
wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA
Cédric Notredame (19/04/23)
Cost
L
Afine Gap Penalty
Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
GOP
GEP
GOP GOP
GOP
Parsimony: Evolution takes the simplest path
(So We Think…)
Cédric Notredame (19/04/23)
Insertions and Deletions
Gap Penalties
• Opening a gap is more expensive than extending it
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
Seq AGARFIELDTHE----CAT||||||||||| |||
Seq BGARFIELDTHELASTCAT
gap
Gap Opening PenaltyGap Extension Penalty
Cédric Notredame (19/04/23)
Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
>Seq1THEFATCAT>Seq2THEFASTCAT
-DYNAMIC PROGRAMMING
DYNAMICPROGRAMMING
THEFA-TCATTHEFASTCAT
Cédric Notredame (19/04/23)
Global Alignments
F A S T
F A T
----FATFAST---
(L1+l2)!
(L1)!*(L2)!
---FAT-FAST---
--F-AT-FAST---
Brut Force Enumeration
2
( )
DYNAMIC PROGRAMMING
Cédric Notredame (19/04/23)
Global AlignmentsDYNAMIC PROGRAMMING
Match=1 MisMatch=-1Gap=-1
FAT
F A S T
1
-1
-1
-2
-3
0
-2 -3 -4
2
0
0
Dynamic Programming (Needlman and Wunsch)
FAT
F A S T
1
-1
-1
-2
-3
0
-2 -3 -4
2
0
0 -1 0
0
21-1-1
1
FAT
F A S T
1
-1 -2 -3 -4
2
0
2
1
F A S TF A - T
Cédric Notredame (19/04/23)
Global AlignmentsDYNAMIC PROGRAMMING
Global Alignments are very sensitive to gap Penalties
GOP
GEP
Cédric Notredame (19/04/23)
Global AlignmentsDYNAMIC PROGRAMMING
Global Alignments are very sensitive to gap Penalties
Global Alignments do not take into account the MODULAR nature of Proteins
C: K vitamin dep. Ca BindingK: Kringle DomainG: Growth Factor moduleF: Finger Module
Cédric Notredame (19/04/23)
Local Alignments
GLOBAL Alignment
LOCAL Alignment
Smith And Waterman (SW)=LOCAL Alignment
Cédric Notredame (19/04/23)
Local Alignments
We now have a PairWise Comparison Algorithm,
We are ready to search Databases
Cédric Notredame (19/04/23)
Database Search
1.10e-20
10
1.10e-100
1.10e-2
1.10e-1
10
3
1
3
6
1.10e-2
1
20
15
13
QUERRY
Comparison Engine
Database
E-valuesHow many time do we expect such anAlignment by chance?
SWQ
Cédric Notredame (19/04/23)
-There is a relation between Sequence and Structure.
The Easiest way to Compare Two Sequences is a dotplot.
Sequence Comparison
-Thanks to evolution, We CAN compare Sequences
-Substitution matrices only work well with similar Sequences (More than 30% id).