Multiple sequence alignment Why?

45
Multiple sequence alignment Why? It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly) Recognise alternative splice sites Many bioinformatics methods depend on it (secondary/tertiary structure)

description

Multiple sequence alignment Why?. It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly) - PowerPoint PPT Presentation

Transcript of Multiple sequence alignment Why?

Page 1: Multiple sequence alignment Why?

Multiple sequence alignment

Why? It is the most important means to assess relatedness

of a set of sequences Gain information about the structure/function of a

query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments

(Fragment assembly) Recognise alternative splice sites Many bioinformatics methods depend on it

(secondary/tertiary structure)

Page 2: Multiple sequence alignment Why?

Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY

Page 3: Multiple sequence alignment Why?

Pairwise alignment

Now we know how to do it: How do we get a multiple

alignment (three or more sequences)?

Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..

Page 4: Multiple sequence alignment Why?

Multi-dimensional dynamic programming(Murata et al. 1985)

Page 5: Multiple sequence alignment Why?

Simultaneous Multiple alignmentMulti-dimensional dynamic programming

MSA (Lipman et al., 1989, PNAS 86, 4412)

extremely slow and memory intensive up to 8-9 sequences of ~250 residues

DCA (Stoye et al., 1997, CABIOS 13, 625)

still very slow

Page 6: Multiple sequence alignment Why?

Alternative multiple alignment methods

Biopat (Hogeweg Hesper 1984, first method ever)

MULTAL (Taylor 1987) DIALIGN (Morgenstern 1996) PRRP (Gotoh 1996) Clustal (Thompson Higgins Gibson 1994) Praline (Heringa 1999) T-Coffee (Notredame Higgins Heringa 2000) HMMER (Eddy 1998) [Hidden Markov Model] SAGA (Notredame Higgins1996) [Genetic

algorithm]

Page 7: Multiple sequence alignment Why?

Progressive multiple alignment general principles

1213

45

Guide tree Multiple alignment

Score 1-2

Score 1-3

Score 4-5

Scores Similaritymatrix5×5

Scores to distances Iteration possibilities

Page 8: Multiple sequence alignment Why?

General progressive multiple alignment technique(follow generated tree)

13

25

13

13

13

25

254

d

root

Page 9: Multiple sequence alignment Why?

Progressive multiple alignment

Problem: Accuracy is very important Errors are propagated into the

progressive steps

“Once a gap, always a gap”

Feng & Doolittle, 1987

Page 10: Multiple sequence alignment Why?

Pair-wise alignment quality versus sequence identity(Vogt et al., JMB 249, 816-831,1995)

Page 11: Multiple sequence alignment Why?

Multiple alignment profilesGribskov et al. 1987

ACDWY

Gappenalties

i0.30.100.30.3

0.51.0

Position dependent gap penalties

Page 12: Multiple sequence alignment Why?

ACD……VWY

sequence

profile

Profile-sequence alignment

Page 13: Multiple sequence alignment Why?

ACD..Y

ACD……VWY

profile

profileProfile-profile alignment

Page 14: Multiple sequence alignment Why?

Clustal, ClustalW, ClustalX CLUSTAL W/X (Thompson et al., 1994) uses Neighbour

Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree.

Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree.

Further carefully crafted heuristics include: (i) local gap penalties (ii) automatic selection of the amino acid substitution matrix,

(iii) automatic gap penalty adjustment (iv) mechanism to delay alignment of sequences that appear to

be distant at the time they are considered. CLUSTAL (W/X) does not allow iteration (Hogeweg and

Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

Page 15: Multiple sequence alignment Why?

Profile pre-processing Secondary structure-induced

alignment Globalised local alignment Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Page 16: Multiple sequence alignment Why?

Pre-profile generation1213

45

Score 1-2

Score 1-3

Score 4-5

ACD..Y

12345

1ACD..Y

21345

2

Pre-profilesPre-alignments

512354

ACD..Y

Cut-off

Page 17: Multiple sequence alignment Why?

Pre-profile alignment

ACD..YACD..YACD..Y

ACD..Y

ACD..Y

1

2

3

4

5

12345

Pre-profiles

Final alignment

Page 18: Multiple sequence alignment Why?

Pre-profile alignment

12345

12134531245

341235

4512354

2

12345

Final alignment

Page 19: Multiple sequence alignment Why?

Profile pre-processing Secondary structure-induced

alignment Globalised local alignment Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Page 20: Multiple sequence alignment Why?

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

Page 21: Multiple sequence alignment Why?

One of the Molecular Biology Dogma’s

“Structure more conserved than sequence”

Page 22: Multiple sequence alignment Why?

Secondary structure-induced alignment

Page 23: Multiple sequence alignment Why?

Using secondary structure for alignment

Dynamic programmingsearch matrix

Amino acid exchangeweights matrices

MDAGSTVILCFVHHHCCCEEEEEE

MDAASTILCGS

HHHHCCEEECC

C

H

E

H C

E Default

Page 24: Multiple sequence alignment Why?

Flavodoxin-cheYUsing predicted secondary structure1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeeeFLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeeeFLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeeeFLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeeeFLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeeeFLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeeeFLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeeeFLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeeeFLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeeeFLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeeeFLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhhFLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhhFLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------- eee hhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------- hhhhhhhhhhhh eeeee e eeeFLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhhtFLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhFLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhhFLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhhtFLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------- hhhhhhhhhhh eeeee eeee h hhhhhhhhFLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h3chy -----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht

G

Page 25: Multiple sequence alignment Why?

Profile pre-processing Secondary structure-induced

alignment Globalised local alignment Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Page 26: Multiple sequence alignment Why?

Globalised local alignment

+ =

1. Local (SW) alignment (M + Po,e)

2. Global (NW) alignment (no M or Po,e)

Double dynamic programming

Page 27: Multiple sequence alignment Why?

M = BLOSUM62, Po= 0, Pe= 0

Page 28: Multiple sequence alignment Why?

M = BLOSUM62, Po= 12, Pe= 1

Page 29: Multiple sequence alignment Why?

M = BLOSUM62, Po= 60, Pe= 5

Page 30: Multiple sequence alignment Why?

Profile pre-processing Secondary structure-induced

alignment Globalised local alignment Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Page 31: Multiple sequence alignment Why?

Matrix extension

T-CoffeeTree-based Consistency Objective Function

For alignmEnt Evaluation

Cedric Notredame

Des Higgins

Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205-217302, 205-217;2000;2000

Page 32: Multiple sequence alignment Why?

Matrix extension – T COFFEE

12

13

14

23

24

34

Page 33: Multiple sequence alignment Why?

Integrating alignment methods and alignment information with

T-Coffee• Integrating different pair-wise alignment

techniques (NW, SW, ..)

• Combining different multiple alignment methods (consensus multiple alignment)

• Combining sequence alignment methods with structural alignment techniques

• Plug in user knowledge

Page 34: Multiple sequence alignment Why?

Using different sources of alignment information

Clustal

Dialign

Clustal

Lalign

Structure alignments

Manual

T-Coffee

Page 35: Multiple sequence alignment Why?

Search matrix extension

Page 36: Multiple sequence alignment Why?

T-Coffee• Combine different alignment techniques by adding scores:

W(A(x), B(y)) = S(A(x), B(y))

– A(x) is residue x in sequence A

– summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y))

– S is sequence identity percentage of the associated alignment

• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:

W’(A(x), B(y)) = W(A(x), B(y)) +

IA,BMin(W(A(x), I(z)), W(I(z), B(y)))

– Summation over all third sequences I other than A or B

Page 37: Multiple sequence alignment Why?

T-Coffee

Direct alignment

Other sequences

Page 38: Multiple sequence alignment Why?

Search matrix extension

Page 39: Multiple sequence alignment Why?

Evaluating multiple alignmentsEvaluating multiple alignments Conflicting standards of truth

evolution structure function

With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment

databases Different ways to quantify agreement with reference

alignment (sum-of-pairs, column score) “Charlie Chaplin” problem

Page 40: Multiple sequence alignment Why?

Evaluating multiple alignmentsEvaluating multiple alignments

As a standard of truth, often a reference alignment based on structural superpositioning is taken

Page 41: Multiple sequence alignment Why?

Evaluation measuresQuery Reference

Column score

Sum-of-Pairs score

Page 42: Multiple sequence alignment Why?

Evaluating multiple alignmentsEvaluating multiple alignments

SP

BAliBASE alignment nseq * len

Page 43: Multiple sequence alignment Why?

Summary

Weighting schemes simulating simultaneous multiple alignment Profile pre-processing (global/local) Matrix extension (well balanced scheme)

Smoothing alignment signals globalised local alignment

Using additional information secondary structure driven alignment

Schemes strike balance between speed and sensitivity

Page 44: Multiple sequence alignment Why?

References Heringa, J. (1999) Two strategies for sequence

comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.

Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.

Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

Page 45: Multiple sequence alignment Why?

Where to find this….http://www.ibivu.cs.vu.nl/teaching