Multiple alignments, PATTERNS, PSI-BLAST
description
Transcript of Multiple alignments, PATTERNS, PSI-BLAST
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Multiple alignments, PATTERNS, PSI-BLAST
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Overview
Multiple alignments How-to, Goal, problems, use
Patterns PROSITE database, syntax, use
PSI-BLAST BLAST, matrices, use
[ Profiles/HMMs ] …
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
What is a multiple sequence alignment?
What can it do for me? How can I produce one of these? How can I use it?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How can I use a multiple alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPunknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------unknown AKDDRIRYDNEMKSWEEQMAE * : .* . :
Extrapolation
SwissProt
Unkown Sequence
Homology?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How can I use a multiple alignment?
SwissProt
Unkown Sequence
Match?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
Extrapolation
Prosite Patterns
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How can I use a multiple alignment?
Extrapolation
Prosite Patterns
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
L?K>R
Prosite Profiles -More Sensitive-More Specific
AFDEFGHQIVLW
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How can I use a multiple alignment?
Phylogeny
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
chite
wheat
trybr
mouse
-Evolution-Paralogy/Orthology
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How can I use a multiple alignment?
Phylogeny
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
Struc. Prediction
PhD For secondary Structure Prediction: 75% Accurate.
Threading: is improving but is not yet as good.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How can I use a multiple alignment?
Phylogeny
Struc. Prediction
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
Caution!
Automatic MultipleSequence Alignment methodsare not always perfect…
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
The problem
why is it difficult to compute a multiple sequence alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
Computation
What is the good alignment?
Biology
What is a good alignment?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
The problem
why is it difficult to compute a multiple sequence alignment?
CIRCULAR PROBLEM....
GoodSequences
GoodAlignment
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
What do I need to know to make a good multiple alignment?
How do sequences evolve? How does the computer align the sequences? How can I choose my sequences? What is the best program? How can I use my alignment?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
An alignment is a story
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
ADKPRRPLS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutations+
Selection
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
InsertionDeletion
Mutation
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Homology
Same sequences -> same origin? -> same function? -> same 3D fold?
Length
%Sequence Identity
30%
100
Same 3D Fold
Twilight Zone
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Convergent evolution
AFGP with (ThrAlaAla)nSimilar To Trypsynogen
AFGP with (ThrAlaAla)nNOT
Similar to Trypsinogen
N
S
Chen et al, 97, PNAS, 94, 3811-16
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Residues and mutations
All residues are equal, but some more than others…
PG
SC
LI
T
V A
WYF QH
K
R
ED N
Aliphatic
Aromatic
Hydrophobic
Polar
SmallM
Accurate matrices are data driven rather than knowledge driven
GC
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Substitution matrices
Different Flavors:
• Pam: 250, 350• Blosum: 45, 62• …
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
What is the best substition matrix?
Mutation rates depend on families
Choosing the right matrix may be tricky Gonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its
tuning
Family S N Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8
Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Insertions and deletions?
Indel Cost
L
Cost
L
Cost
L
Affine Gap PenaltyCost=GOP+GEP*L
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
2 Globins =>1 sec
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
3 Globins =>2 mn
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
4 Globins =>5 hours
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
5 Globins =>3 weeks
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
6 Globins =>9 years
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
7 Globins =>1000 years
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
8 Globins =>150 000 years
How to align many sequences?
Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman
-> heuristic required!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Existing methods
1-Carillo and Lipman:
-MSA, DCA.
-Few Small Closely Related Sequence.
2-Segment Based:
-DIALIGN, MACAW.
-May Align Too Few Residues
-Do Well When They Can Run.
3-Iterative:-HMMs, HMMER, SAM.
-Slow, Sometimes Inacurate
-Good Profile Generators
4-Progressive:
-ClustalW, Pileup, Multalign…
-Fast and Sensitive
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Progressive alignmentFeng and Dolittle, 1980; Taylor 1981
Dynamic Programming Using A Substitution Matrix
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Progressive alignmentFeng and Dolittle, 1980; Taylor 1981
-Depends on the ORDER of the sequences (Tree).
-Depends on the CHOICE of the sequences.
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
•Tree making Algorithm.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Selecting sequences from a BLAST output
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
A common mistake
Sequences too closely related
Identical sequences brings no information Multiple sequence alignments thrive on
diversity
PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEEPRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEEPRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEEPRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:***********
PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAESPRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSESPRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAESPRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAESPRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:**
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Respect information!
PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKAPRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKAPRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKAPRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKAPRVA_RAT ------------------------------------------SMTDLLS----AEDIKKAPRVA_RABIT ------------------------------------------AMTELLN----AEDIKKATPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*. .*::::
PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFIPRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFIPRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSIPRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSIPRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** :
PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES-PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES-PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES-PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *: . .. :: .: : *: ***:.**:*. :** ::
-This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences.
-A better spread of the sequences is needed
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Selecting diverse sequences
PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIEPRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIEPRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIEPRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIEPRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIEPRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEPRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:**
PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** **
-A REASONABLE model now exists.
-Going further:remote homologues.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Aligning remote homologuesPRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKAPRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKAPRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAAPRVB_BOACO ------------------------------------------AFAGILSD----ADIAAGPRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTAPRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAAPRVB_RANES ------------------------------------------SITDIVSE----KDIDAATPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : ::
PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFVPRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLFPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFPRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELFPRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM : . .: .. . *: * : * :* : .*:*: :** .
PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQTPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQTPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE :: .. :: : :: .* :.** *. :** ::
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Going further…
PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMMTPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI . : .. . :: . : * :* : .* *. : * .
PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG--PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ---TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ-TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE-TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA : . :: : :: * :..* :. :** ::
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
What makes a good alignment…
The more divergeant the sequences, the better The fewer indels, the better Nice ungapped blocks separated with indels Different classes of residues within a block:
Completely conserved Size and hydropathy conserved Size or hydropathy conserved
The ultimate evaluation is a matter of personal judgment and knowledge
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Avoiding pitfalls
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Keep a biological perspective
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLStrybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * *** .:: ::... : * . . . : * . *: *
chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G-wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESAtrybr RKVYEEMAEKDKERY----K--RE-M-------mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE----- : : * : .* :
DIFFERENTPARAMETERS
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Do not overtune!!!
DO NOT PLAY WITH
PARAMETERS! IF YOU KNOW
THE ALIGNMENT YOU WANT:
MAKE IT YOURSELF!
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. * .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Choosing the right methodPROBLEM
PROGRAM
ClustalW
ClustalW
MSA
DIALIGN II
DIALIGN II
METHOD
Source: BaliBase
Thompson et al, NAR, 1999
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2002.10
Conclusion
The best alignment method: Your brain The right data
The best evaluation method: Your eyes Experimental information
(SwissProt) What can I conclude?
Homology -> information extrapolation
How can I go further? Patterns Profiles HMMs …