Multiple Sequence Alignment
description
Transcript of Multiple Sequence Alignment
![Page 1: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/1.jpg)
Multiple Sequence Alignment
![Page 2: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/2.jpg)
Multiple Sequence Alignment
VTISCTGSSSNIGAGNHVKWYQQLPGVTISCTGTSSNIGSITVNWYQQLPGLRLSCSSSGFIFSSYAMYWVRQAPGLSLTCTVSGTSFDDYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDGATLVCLISDFYPGAVTVAWKADSATLVCLISDFYPGAVTVAWKADSAALGCLVKDYFPEPVTVSWNSG-VSLTCLVKGFYPSDIAVEWESNG-
• Goal: Bring the greatest number of similar characters into the same column of the alignment. (Similar to alignment of two sequences)
• Definition: In an MSA, homologous residues among a set of sequences are aligned together in columns. Homologous is meant in both the structural and evolutionary sense. Ideally, a column of aligned residues occupy similar three-dimensional structural positions and all diverge from a common ancestral residue.
• Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single correct multiple alignment.
![Page 3: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/3.jpg)
Multiple Sequence Alignment: Motivation
• Correspondence. Find out which parts “do the same thing”
– Similar genes are conserved across widely divergent species, often performing similar functions
• Structure prediction
– Use knowledge of structure of one or more members of a protein MSA to predict structure of other members
– Structure is more conserved than sequence
• Create “profiles” for protein families
– Allow us to search for other members of the family
• Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs
• MSA is the starting point for phylogenetic analysis
![Page 4: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/4.jpg)
Multiple Sequence Alignment: Approaches
• Optimal Global Alignments -Dynamic programming
– Generalization of Needleman-Wunsch
– Find alignment that maximizes a score function
– Computationally expensive: Time grows as product of sequence lengths
• Global Progressive Alignments - Match closely-related sequences first using a
guide tree
• Global Iterative Alignments - Multiple re-building attempts to find best alignment
• Local alignments
– Profiles, Blocks, Patterns
![Page 5: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/5.jpg)
Sum-of-Pairs/SP Score Function
Score of multiple alignment
= ∑i <j score(Si,Sj)
where score(Si,Sj) = score of induced pair-wise alignment
QuickTime™ and a decompressor
are needed to see this picture.
![Page 6: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/6.jpg)
Materials & Methods• Group of related proteins & how to find them: Protein sequences can be related
by homology (common ancestor) or convergence (proteins independently evolve to have common sequence features that typically perform a common function or have a common structure).
• How many sequences are needed?: More is better.
• Redundancy: Redundant sequences will bias the alignment toward their own features. A good rule of thumb in cases of varied similarities is to have only a single
sequence from each set with more than 70-80% intra-sequence similarity.
![Page 7: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/7.jpg)
MSA: Different Methods
• Exact methods• Progressive (ClustalW)• Iterative (MUSCLE)• Consistency (ProbCons)• Structure-based (Expresso)
![Page 8: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/8.jpg)
Progressive Multiple Sequence Alignment
• Compare all sequences pairwise. • Perform cluster analysis on the pairwise data to generate a hierarchy
for alignment. This may be in the form of a binary tree or a simple ordering
• Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.
![Page 9: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/9.jpg)
Steps in Multiple Alignment
![Page 10: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/10.jpg)
CLUSTALW MSA
MSA of four oxidoreductase NAD binding domain protein sequences. Red: AVFPMILW. Blue: DE. Magenta: RHK. Green: STYHCNGQ. Grey: all others. Residue ranges are shown after sequence names.
![Page 11: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/11.jpg)
Multiple sequence alignment methods
Iterative methods: compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges.
Examples: MUSCLE, IterAlign, Praline, MAFFT
![Page 12: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/12.jpg)
MUSCLE: next-generation progressive MSA
[1] Build a draft progressive alignment
• Determine pairwise similarity through k-mer counting (not by alignment)
• Compute distance (triangular distance) matrix
• Construct tree using UPGMA
• Construct draft progressive alignment following tree
Page 191
![Page 13: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/13.jpg)
MUSCLE: next-generation progressive MSA
[2] Improve the progressive alignment
• Compute pairwise identity through current MSA
• Construct new tree with Kimura distance measures
• Compare new and old trees: if improved, repeat this step, if not improved, then we’re done
Page 191
![Page 14: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/14.jpg)
MUSCLE: next-generation progressive MSA
[3] Refinement of the MSA
• Split tree in half by deleting one edge
• Make profiles of each half of the tree
• Re-align the profiles
• Accept/reject the new alignment
![Page 15: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/15.jpg)
MUSCLE: next-generation progressive MSA
[3] Refinement of the MSA
• Split tree in half by deleting one edge
• Make profiles of each half of the tree
• Re-align the profiles
• Accept/reject the new alignment
![Page 16: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/16.jpg)
Iterative approaches: MAFFT
•Uses Fast Fourier Transform to speed up profile alignment•Uses fast two-stage method for building alignments using k-mer frequencies•Offers many different scoring and aligning techniques•One of the more accurate programs available•Available as standalone or web interface•Many output formats, including interactive phylogenetic trees
Page 190
![Page 17: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/17.jpg)
MSA: Consistency Approach
• Consistency-based algorithms: generally use a database of both local high-scoring alignments and long-range global alignments to create a final alignment
• These are very powerful, very fast, and very accurate methods
• Examples: T-COFFEE, Prrp, DiAlign, ProbCons
Page 192
![Page 18: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/18.jpg)
ProbCons—consistency-based approach
•Combines iterative and progressive approaches with a unique probabilistic model.
•Uses Hidden Markov Models to calculate probability matrices for matching residues, uses this to construct a guide tree
•Progressive alignment hierarchically along guide tree
•Post-processing and iterative refinement (a little like MUSCLE)
![Page 19: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/19.jpg)
Structure Based MSA
• Structure based alignment is more informative and give us more efficient information about Proteins.
o 3D structure is more conserved then simple amino acid sequence. o Tertiary protein structure evolves more slowly than primary structure. So it is
possible to improve the accuracy of alignment by including information about the three dimensional structure of protein.
The structure of Protein is very complex. So Protein structural alignment is more time consuming and complicated.
![Page 20: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/20.jpg)
Structure Based MSA: 3D-Coffee and Expresso
• 3D-Coffee is a special mode of T-Coffee that uses structural information. In 3D-Coffee the original sequences are replaced by homologous structures (templates). The templates are aligned using strutural aligners like sap or TMalign and these structure based alignments are used as a template for realigning the original sequences. Provided the right templates have been found, the result is a structure based multiple sequence alignment. If some of the sequences do not have any template, they are treated like regular sequences and aligned with the appropriate methods.
• Expresso is an extension of 3D-Coffee where the structural templates are automatically identified with BLAST.
![Page 21: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/21.jpg)
QuickTime™ and a decompressor
are needed to see this picture.
![Page 22: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/22.jpg)
Structure Based MSA: 3D-Coffee and Expresso
Protein Data Bank accession numbers
![Page 23: Multiple Sequence Alignment](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681443d550346895db0d98d/html5/thumbnails/23.jpg)
Two Extremes of MSA
• Recently diverged sequences from each other did not change much. Hence these will be very similar across the length. So, we cannot tell whether distant regions that are conserved due to their importance from regions that did not have time to diverge.
• Sequences that are very diverged from each other (or perhaps are
not related to begin with) and have no similar sequence regions. It
still is possible that the proteins are related and have corresponding
regions, but we cannot identify these by their sequence. This can
change if we have more data (more sequences and/or experimental
data) or use better alignment methods.