Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related...

99
Alineamient o Múltiple de secuencias

Transcript of Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related...

Page 1: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

AlineamientoMúltiple de secuencias

Page 2: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 3: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

• It is used to decide if two proteins (or genes) are related structurally or functionally

• It is used to identify domains or motifs that are shared between proteins

• It is the basis of BLAST searching

• It is used in the analysis of genomes

Pairwise sequence alignment is the most fundamental operation of bioinformatics

Page 4: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Pairwise alignment: protein sequencescan be more informative than DNA

• protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties

• codons are degenerate: changes in the third position often do not alter the amino acid that is specified

• protein sequences offer a longer “look-back” time

• DNA sequences can be translated into protein, and then used in pairwise alignments

Page 5: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Page 54

Page 6: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

• DNA can be translated into six potential proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

Pairwise alignment: protein sequencescan be more informative than DNA

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Page 7: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

HomologySimilarity attributed to descent from a common ancestor.

Definitions

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + + GTW++ MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V LHRWEN 81

Identity

The extent to which two (nucleotide or amino acid) sequences are invariant.

Page 44

Page 8: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.

Paralogs Homologous sequences within a single species that arose by gene duplication.

Definitions: two types of homology

Page 43

Page 9: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Orthologs:members of a gene (protein)family in variousorganisms.This tree showsRBP orthologs.

common carp

zebrafish

rainbow trout

teleost

African clawed frog

chicken

mouserat

rabbitcowpighorse

human

10 changes Page 43

Page 10: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Paralogs:members of a gene (protein)family within aspecies

apolipoprotein D

retinol-bindingprotein 4

Complementcomponent 8

prostaglandinD2 synthase

neutrophilgelatinase-associatedlipocalin

10 changesLipocalin 1Odorant-bindingprotein 2A

progestagen-associatedendometrialprotein

Alpha-1Microglobulin/bikunin

Page 44

Page 11: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 12: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and -lactoglobulin

Page 46

Page 13: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

retinol-binding protein(NP_006735)

-lactoglobulin(P02754)

Page 42

Page 14: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

SimilarityThe extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation.

IdentityThe extent to which two sequences are invariant.

Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

Definitions

Page 15: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and -lactoglobulin

Identity(bar)

Page 46

Page 16: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and -lactoglobulin

Somewhatsimilar

(one dot)

Verysimilar

(two dots)

Page 46

Page 17: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Pairwise alignment of retinol-binding protein and -lactoglobulin

Internalgap

Terminalgap Page 46

Page 18: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA

fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST

fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

Multiple sequence alignment of ‘ortologues’glyceraldehyde 3-phosphate dehydrogenases

Page 49

Page 19: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2aTKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo.VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein DVKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding proteinLQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass.VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthaseVQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulinPKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... complement component 8

Multiple sequence alignment ofhuman lipocalin ‘paralogs’

Page 49

Page 20: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Calculation of an alignment score

Page 21: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

PAM matrices are based on global alignments of closely related proteins.

The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence.

Other PAM matrices are extrapolated from PAM1.

All the PAM data come from closely related proteins(>85% amino acid identity)

PAM matrices:Point-accepted mutations

Page 22: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Comparing two proteins with a PAM1 matrixgives completely different results than PAM250!

Consider two distantly related proteins. A PAM40 matrixis not forgiving of mismatches, and penalizes themseverely. Using this matrix you can find almost no match.

A PAM250 matrix is very tolerant of mismatches.

hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * **

24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** *

hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** ** Page 60

Page 23: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

BLOSUM matrices are based on local alignments.

BLOSUM stands for blocks substitution matrix.

BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.

BLOSUM Matrices

Page 60

Page 24: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Rat versus mouse RBP

Rat versus bacteriallipocalin

Page 25: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

PAM matrices reflect different degrees of divergence

PAM250

Page 26: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Ancestral sequence

Sequence 1 Sequence 2

A no change AC single substitution C --> AC multiple substitutions C --> A --> TC --> G coincidental substitutions C --> AT --> A parallel substitutions T --> AA --> C --> T convergent substitutions A --> TC back substitution C --> T --> C

ACCCTAC

Li (1997) p.70

Page 27: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

True positives False positives

False negatives

Sequences reportedas related

Sequences reportedas unrelated

True negatives

homologoussequences

non-homologoussequences

Sensitivity:ability to findtrue positives

Specificity:ability to minimize

false positives

Page 28: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Outline

-Why Do We Need Multiple Sequence Alignment ?

-The progressive Alignment Algorithm

-A possible Strategy…

-Potential Difficulties

Page 29: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Pre-requisite

-How Do Sequences Evolve?

-How can We COMPARE Sequences ?

-How can We ALIGN Sequences ?

Page 30: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

What is A Multiple Sequence Alignment?

Structural Criteria:Residues are arranged so that those playing a similar role end up in the same column.

Evolution Criteria:Residues are arranged so that those having the same ancestor end up in the same column.

Page 31: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

PhylogenicRelation

FunctionalRelation

Page 32: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 33: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPunknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------unknown AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Extrapolation Beyond The Twilight Zone

SwissProtUnkown Sequence

Homology?

Less Than 30 % idBUT

Conserved where it MATTERS

Page 34: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Extrapolation

Prosite PatternsP-K-R-[PA]-x(1)-[ST]…

Page 35: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Extrapolation

Prosite Patterns

Profiles And HMMs

-More Sensitive-More Specific

L?K>R

AFDEFGHQIVLW

Page 36: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Extrapolation

Motifs/Patterns

Phylogeny

chite

wheattrybr

mouse

-Evolution-Paralogy/Orthology

Profiles

Page 37: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Extrapolation

Motifs/Patterns

Phylogeny

Profiles

Struc. Prediction

Column Constraint

Evolution Constraint

Structure Constraint

Page 38: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Extrapolation

Motifs/Patterns

Phylogeny

Profiles

Struc. Prediction

PsiPred OR PhD For secondary Structure Prediction: 75% Accurate.

Threading: is improving but is not yet as good.

Page 39: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

How Can I Use A Multiple Sequence Alignment?

Automatic MultipleSequence Alignment methodsare not always perfect…

You know better…With your big BRAIN

Page 40: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 41: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Why Is It Difficult To Compute A multiple Sequence Alignment?

A CROSSROAD PROBLEM

BIOLOGY:What is A Good Alignment

COMPUTATIONWhat is THE Good Alignment

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

Page 42: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

The Biological Problem.How to Evaluate an Alignment

-Substitution Matrix (Blosum)

-An Evaluation Function

AAACC

-Gap Penalties.

-A nice set of Sequences

A

A

A CSums of Pairs: Cost=6

C

Over-estimation of the Substitutions

Easy to compute

Page 43: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

The COMPUTATIONAL Problem.Producing the Alignment

-Substitution Matrix (Blosum)

-An Evaluation Function

-Gap Penalties.

-A nice set of Sequences

-An Alignment Algorithm

GLOBAL Alignment

Will It Work?

Page 44: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

HOW CAN I ALIGN MANY SEQUENCES

2 Globins =>1 Min

Page 45: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

3 Globins =>2 hours

HOW CAN I ALIGN MANY SEQUENCES

Page 46: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

4 Globins => 10 days

HOW CAN I ALIGN MANY SEQUENCES

Page 47: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

5 Globins => 3 years

HOW CAN I ALIGN MANY SEQUENCES

Page 48: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

6 Globins =>300 years

HOW CAN I ALIGN MANY SEQUENCES

Page 49: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

7 Globins =>30. 000 years

HOW CAN I ALIGN MANY SEQUENCES

Solidified Fossil,Old stuff

Page 50: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

8 Globins =>3 Million years

HOW CAN I ALIGN MANY SEQUENCES

Page 51: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

The Progressive Multiple Alignment

Algorithm(Clustal W)

Page 52: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Making An Alignment

Any Exact Method would be TOO SLOW

We will use a Heuristic Algorithm.

Progressive Alignment Algorithm is the most Popular

-Fast

-ClustalW

-Greedy Heuristic (No Guarranty).

Page 53: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Progressive Alignment

Feng and Dolittle, 1988

Clustering

Page 54: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Dynamic Programming Using A Substitution Matrix

Progressive Alignment

Page 55: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Progressive Alignment

-Depends on the ORDER of the sequences (Tree).

-Depends on the CHOICE of the sequences.

-Depends on the PARAMETERS:

•Substitution Matrix.

•Penalties (Gop, Gep).

•Sequence Weight.

•Tree making Algorithm.

Page 56: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Progressive AlignmentWhen Does It Work

Works Well When Phylogeny is Dense

No outlayer Sequence.

Image: River Crossing

Page 57: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

SeqA GARFIELD THE LAST FA-T CATSeqB GARFIELD THE FAST CA-T ---SeqC GARFIELD THE VERY FAST CATSeqD -------- THE ---- FA-T CAT

CLUSTALW (Score=20, Gop=-1, Gep=0, M=1)

SeqA GARFIELD THE LAST FA-T CATSeqB GARFIELD THE FAST ---- CATSeqC GARFIELD THE VERY FAST CATSeqD -------- THE ---- FA-T CAT

CORRECT (Score=24)

Progressive AlignmentWhen Doesn’t It Work

Page 58: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

GARFIELD THE LAST FAT CATGARFIELD THE FAST CAT ---

GARFIELD THE LAST FAT CAT

GARFIELD THE FAST CAT

GARFIELD THE VERY FAST CAT

THE FAT CAT

GARFIELD THE VERY FAT CAT-------- THE ---- FAT CAT

GARFIELD THE LAST FA-T CATGARFIELD THE FAST CA-T ---GARFIELD THE VERY FAST CAT-------- THE ---- FA-T CAT

Page 59: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Building the Right Multiple Sequence

Alignment.

Page 60: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Recognizing The Right Sequences When you Meet Them…

Page 61: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Gathering Sequences: BLAST

Page 62: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Common Mistake:Sequences Too Closely Related

PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEEPRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEEPRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEEPRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:***********

PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAESPRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSESPRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAESPRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAESPRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:**

-IDENTICAL SEQUENCES BRING NO INFORMATION FOR THE MULTIPLE SEQUENCE ALIGNMENT

-MULTIPLE SEQUENCE ALIGNMENTS THRIVE ON DIVERSITY…

Page 63: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 64: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Selecting Diverse Sequences (Opus II)

Page 65: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Respect Information!

This Alignment Is not Informative about the relation Betwwen TPCC MOUSE and the rest of the sequences.

PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKAPRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKAPRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKAPRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKAPRVA_RAT ------------------------------------------SMTDLLS----AEDIKKAPRVA_RABIT ------------------------------------------AMTELLN----AEDIKKATPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*. .*::::

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFIPRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFIPRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSIPRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSIPRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

-A better Spread of the Sequences is needed

Page 66: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Selecting Diverse Sequences (Opus II)

Page 67: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Selecting Diverse Sequences (Opus II)

PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIEPRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIEPRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIEPRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIEPRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIEPRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEPRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:**

PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** **

-A REASONABLE Model Now Exists.

-Going Further:Remote Homologues.

Page 68: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Aligning Remote Homologues

PRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKAPRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKAPRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAAPRVB_BOACO ------------------------------------------AFAGILSD----ADIAAGPRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTAPRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAAPRVB_RANES ------------------------------------------SITDIVSE----KDIDAATPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : ::

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFVPRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLFPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFPRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELFPRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM : . .: .. . *: * : * :* : .*:*: :** .

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQTPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQTPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE :: .. :: : :: .* :.** *. :** ::

Page 69: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

SomeGuideline

s…

Page 70: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Do Not Use Two Many Sequences…

Page 71: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Reading Your Alignment

Page 72: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 73: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

WHAT MAKES A GOOD ALIGNMENT…

-THE MORE DIVERGEANT THE SEQUENCES, THE BETTER

-THE FEWER INDELS, THE BETTER

-NICE UNGAPPED BLOCKS SEPARATED WITH INDELS

-DIFFERENT CLASSES OF RESIDUES WITHIN A BLOCK:

•Completely Conserved•Conserved For Size and Hydropathy•Conserved For Size or Hydropathy

-THE ULTIMATE EVALUATION IS A MATTER OF PERSONNAL JUDGEMENT AND KNOWLEDGE.

Page 74: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Potential Difficulties

Page 75: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

DO NOT OVERTUNE!!!

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

DO NOT PLAY WITH PARAMETERS IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF!

chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :*: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE * : .* . :

Page 76: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

TUNING or NOT TUNING!!!

-MOST METHODS ARE TUNED FOR WORKING WELL ON AVERAGE

-PARAMETERS BEHAVIOUR DO NOT NECESSARILY FOLLOW THE THEORY (i.e. Substitution Matrices).

-A GOOD ALIGNMENT IS USUALLY ROBUST(i.e. Changes little).

-TUNE IF YOU WANT TO CONVINCE YOURSELF.

-PARAMETERS TO TUNE USUALLY INCLUDE:•GOP/ GEP•MATRIX•SENSITIVITY Vs SPEED

GOP

GEP

Substitution Matrices (Etzold and al. 1993)

Gonnet 61.7 %Blosum50 59.7 %

Pam250 59.2 %

Page 77: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 78: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

KEEP A BIOLOGICAL PERSPECTIVE

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: *

chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLStrybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * *** .:: ::... : * . . . : * . *: *

DIFFERENT PARAMETERS

WRONG ALIGNMENT !!!

Page 79: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

REPEATS

THERE IS A PROBLEM WHEN TWO SEQUENCES DO NOT CONTAIN THE SAME NUMBER OF REPEATS

IT IS THEN BETTER TO MANUALLY EXTRACT THE REPEATS AND TO ALIGN THEM. INDIVIDUAL REPEATS CAN BE RECOGNIZED USING DOTTER

Page 80: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 81: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Naming Your Sequences The Right Way

Page 82: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Choosing the right Method

Page 83: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Simultaneous Alignments : MSA

1) Set Bounds on each pair of sequences (Carillo and Lipman)

2) Compute the Maln within the Hyperspace

-Few Small Closely Related Sequence.

-Do Well When They Can Run.

-Memory and CPU hungry

Page 84: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Dialign II

1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair.

3) Assemble the alignment according to the segment pairs.

2) Ré-évaluate each segment pair according to its consistency with the others

Page 85: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Dialign II

-May Align Too Few Residues

-No Gap Penalty-Does well with ESTs

Page 86: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

7.16.1 ProgressiveIterative Methods

-HMMs, HMMER, SAM.

-Slow, Sometimes Inaccurate-Good Profile Generators

Page 87: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Mixing Local and Global Alignments

Local Alignment Global Alignment

Extension

Multiple Sequence Alignment

Page 88: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

What Is BaliBaseBaliBase

DescriptionPROBLEM

Source: BaliBase, Thompson et al, NAR, 1999,

Even Phylogenic Spread.

One Outlayer Sequence

Two Distantly related Groups

Long Internal Indel

Long Terminal Indel

Page 89: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

What Is BaliBaseWhich Method ?

PROBLEM

Source: BaliBase, Thompson et al, NAR, 1999,

Strategy

Strategy

ClustalW, T-coffee,MSA, DCA

PrrP,T-Coffee

Dialign

T-Coffee

T-Coffee

Dialign

T-Coffee

Page 90: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Methods /Situtations

1-Carillo and Lipman:-MSA, DCA.

-Few Small Closely Related Sequence.

2-Segment Based:-DIALIGN, MACAW.

-May Align Too Few Residues-Good For Long Indels

-Do Well When They Can Run.

3-Iterative:-HMMs, HMMER, SAM.

-Slow, Sometimes Inaccurate-Good Profile Generators

4-Progressive: -ClustalW, Pileup, Multalign…-Fast and Sensitive

Page 91: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Conclusion

Page 92: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

-The BEST alignment Method: Your BrainThe Right Data

-Beware of repeated elements

Multiple Alignment

-The Best Evaluation Procedure:Experimental Data (SwissProt)

-Choosing The Sequences Well is Important

Page 93: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Editing Multiple Alignments

There are a variety of tools that can be used to modify a multiple alignment.

These programs can be very useful in formatting and annotating an alignment for publication.

An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.

Page 94: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 95: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

BioEdit

Page 96: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Editors on the Web Check out CINEMA (Colour

INteractive Editor for Multiple Alignments) It is an editor created completely in

JAVA (old browsers beware) It includes a fully functional version

of CLUSTAL, BLAST, and a DotPlot module

http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.1

Page 97: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Page 98: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Addresses

Page 99: Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.

Some URLs

EMBL-EBIhttp://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple Alignment

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)http://www.ibc.wustl.edu/service/msa/