Multiple sequence alignment Why?

Multiple sequence alignment

Why? It is the most important means to assess relatedness

of a set of sequences Gain information about the structure/function of a

query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments

(Fragment assembly) Recognise alternative splice sites Many bioinformatics methods depend on it

(secondary/tertiary structure)

Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY

Pairwise alignment

Now we know how to do it: How do we get a multiple

alignment (three or more sequences)?

Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..

Multi-dimensional dynamic programming(Murata et al. 1985)

Simultaneous Multiple alignmentMulti-dimensional dynamic programming

MSA (Lipman et al., 1989, PNAS 86, 4412)

extremely slow and memory intensive up to 8-9 sequences of ~250 residues

DCA (Stoye et al., 1997, CABIOS 13, 625)

still very slow

Alternative multiple alignment methods

Biopat (Hogeweg Hesper 1984, first method ever)

MULTAL (Taylor 1987) DIALIGN (Morgenstern 1996) PRRP (Gotoh 1996) Clustal (Thompson Higgins Gibson 1994) Praline (Heringa 1999) T-Coffee (Notredame Higgins Heringa 2000) HMMER (Eddy 1998) [Hidden Markov Model] SAGA (Notredame Higgins1996) [Genetic

algorithm]

Progressive multiple alignment general principles

1213

45

Guide tree Multiple alignment

Score 1-2

Score 1-3

Score 4-5

Scores Similaritymatrix5×5

Scores to distances Iteration possibilities

General progressive multiple alignment technique(follow generated tree)

13

25

13

13

13

25

254

d

root

Progressive multiple alignment

Problem: Accuracy is very important Errors are propagated into the

progressive steps

“Once a gap, always a gap”

Feng & Doolittle, 1987

Pair-wise alignment quality versus sequence identity(Vogt et al., JMB 249, 816-831,1995)

Multiple alignment profilesGribskov et al. 1987

ACDWY

Gappenalties

i0.30.100.30.3

0.51.0

Position dependent gap penalties

ACD……VWY

sequence

profile

Profile-sequence alignment

ACD..Y

ACD……VWY

profile

profileProfile-profile alignment

Clustal, ClustalW, ClustalX CLUSTAL W/X (Thompson et al., 1994) uses Neighbour

Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree.

Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree.

Further carefully crafted heuristics include: (i) local gap penalties (ii) automatic selection of the amino acid substitution matrix,

(iii) automatic gap penalty adjustment (iv) mechanism to delay alignment of sequences that appear to

be distant at the time they are considered. CLUSTAL (W/X) does not allow iteration (Hogeweg and

Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

Profile pre-processing Secondary structure-induced

alignment Globalised local alignment Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Pre-profile generation1213

45

Score 1-2

Score 1-3

Score 4-5

ACD..Y

12345

1ACD..Y

21345

2

Pre-profilesPre-alignments

512354

ACD..Y

Cut-off

Pre-profile alignment

ACD..YACD..YACD..Y

ACD..Y

ACD..Y

1

2

3

4

5

12345

Pre-profiles

Final alignment

Pre-profile alignment

12345

12134531245

341235

4512354

2

12345

Final alignment

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

One of the Molecular Biology Dogma’s

“Structure more conserved than sequence”

Secondary structure-induced alignment

Using secondary structure for alignment

Dynamic programmingsearch matrix

Amino acid exchangeweights matrices

MDAGSTVILCFVHHHCCCEEEEEE

MDAASTILCGS

HHHHCCEEECC

C

H

E

H C

E Default

Flavodoxin-cheYUsing predicted secondary structure1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeeeFLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeeeFLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeeeFLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeeeFLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeeeFLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeeeFLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeeeFLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeeeFLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeeeFLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeeeFLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhhFLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhhFLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------- eee hhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------- hhhhhhhhhhhh eeeee e eeeFLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhhtFLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhFLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhhFLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhhtFLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------- hhhhhhhhhhh eeeee eeee h hhhhhhhhFLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h3chy -----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht

G

Globalised local alignment

+ =

1. Local (SW) alignment (M + Po,e)

2. Global (NW) alignment (no M or Po,e)

Double dynamic programming

M = BLOSUM62, Po= 0, Pe= 0

Matrix extension

T-CoffeeTree-based Consistency Objective Function

For alignmEnt Evaluation

Cedric Notredame

Des Higgins

Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205-217302, 205-217;2000;2000

Matrix extension – T COFFEE

12

13

14

23

24

34

Integrating alignment methods and alignment information with

T-Coffee• Integrating different pair-wise alignment

techniques (NW, SW, ..)

• Combining different multiple alignment methods (consensus multiple alignment)

• Combining sequence alignment methods with structural alignment techniques

• Plug in user knowledge

Using different sources of alignment information

Clustal

Dialign

Clustal

Lalign

Structure alignments

Manual

T-Coffee

Search matrix extension

T-Coffee• Combine different alignment techniques by adding scores:

W(A(x), B(y)) = S(A(x), B(y))

– A(x) is residue x in sequence A

– summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y))

– S is sequence identity percentage of the associated alignment

• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:

W’(A(x), B(y)) = W(A(x), B(y)) +

IA,BMin(W(A(x), I(z)), W(I(z), B(y)))

– Summation over all third sequences I other than A or B

T-Coffee

Direct alignment

Other sequences

Search matrix extension

Evaluating multiple alignmentsEvaluating multiple alignments Conflicting standards of truth

evolution structure function

With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment

databases Different ways to quantify agreement with reference

alignment (sum-of-pairs, column score) “Charlie Chaplin” problem

Evaluating multiple alignmentsEvaluating multiple alignments

As a standard of truth, often a reference alignment based on structural superpositioning is taken

Evaluation measuresQuery Reference

Column score

Sum-of-Pairs score

Evaluating multiple alignmentsEvaluating multiple alignments

SP

BAliBASE alignment nseq * len

Summary

Weighting schemes simulating simultaneous multiple alignment Profile pre-processing (global/local) Matrix extension (well balanced scheme)

Smoothing alignment signals globalised local alignment

Using additional information secondary structure driven alignment

Schemes strike balance between speed and sensitivity

References Heringa, J. (1999) Two strategies for sequence

comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.

Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.

Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

Where to find this….http://www.ibivu.cs.vu.nl/teaching

Multiple sequence alignment Why?

Documents

Transcript of Multiple sequence alignment Why?