Multiple Sequence Alignments and Phylogeny. [email protected] Within a protein sequence, some...

50
BIOINFORMATICS DR. VÍCTOR TREVIÑO [email protected] Multiple Sequence Alignments and Phylogeny

Transcript of Multiple Sequence Alignments and Phylogeny. [email protected] Within a protein sequence, some...

Page 1: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

BIOINFORMATICSDR. VÍCTOR TREVIÑ[email protected]

Multiple Sequence AlignmentsandPhylogeny

Page 2: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

SEQUENCE SIMILARITY

Within a protein sequence, some regions will be more conserved than others. As more conserved, more important. for function for 3D structure for localization for modification for interaction for regulation/control for transcriptional regulation

(in DNA)

REASONS TOPERFORM

SEQUENCESIMILARITYANALYSIS

ANDSEARCHES

Page 3: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

SEQUENCE ALIGNMENT Procedure for comparing two (pair-wise

alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences Identical residues (nt or aa) are placed in the same

column Non-identical residues can be placed in the same

column or indicated as gaps

Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htmBioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Ove

rall

sim

ilitu

de

Page 4: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MULTIPLE SEQUENCE ANALYSIS – ADDITIONAL USES

Interesting regions Promoter regions Consensus sequence for probe

design

Page 5: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

Multiple Sequence Alignment - MSA

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 6: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MULTIPLE SEQUENCE ALIGNMENT - MSA

Dynamical programming is designed for two sequences

It would take quite a long time for three or more (see MSA program)

Sequence A

Seq

uenc

e B

Sequence C

Page 7: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

RELATION MSA & EVOLUIONARY TREE RECONSTRUCTION

Page 8: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MULTIPLE SEQUENCE ALIGNMENT – METHODS

Extenstions of sequence pair alignment MSA

Progressive Methods CLUSTALW

Iterative Methods Hidden Markov Models (HMM)

Page 9: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MULTIPLE SEQUENCE ALIGNMENT - MSAAlgorithm

1. Calculate all pair-wise alignment scores (alignment costs).

2. Use the scores (costs) to predict a tree.3. Calculate pair weights based on the tree.4. Produce a heuristic msa based on the tree.5. Calculate the maximum for each sequence

pair.6. Determine the spatial positions that must be

calculated to obtain the optimal alignment.7. Perform the optimal alignment.8. Report the epsilon found compared to the

maximum epsilon.epsilon for a given sequence pair is the difference between the score of the alignment of that pair in the msa and the score of the optimal pair-wise alignment. The bigger the value of , the more divergent the msa from the pair-wise alignment and the smaller the contribution of tht alignment to the msa. For example, if an extra copy of one of the sequences is added to the alignment project, then for sequence pairs that do not include that sequence will increase, indicating a lesser role because the contributions of that pair have been out-voted by the alike sequences.

Page 10: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT

Dynamical programming is designed for two sequences

It would take quite a long time for three or more (see MSA program)

Therefore… 1. Pair-wise all sequences2. Determine "distances between each one"3. Align the two most similar then get the

alignment4. Get the next more similar and perform the same

steps until all sequences has been included5. E.G.

1. (S3+S4)=c1,2. (S1+S2)=c23. (c1+c2)=c34. (c3+S5)=final

S1S2S3S4S5

Page 11: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT - CLUSTALW

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

(then normalized tolargest = 1)

Alignment Scorefor column

CLUSTALWMETHOD

Page 12: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT - CLUSTALW

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

1

2

3

Page 13: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT - PROBLEMS

Dependency on the most similar sequences Nested problems when most similar

sequences are actually different So, for closely related sequence, CLUSTALW is

the best

Choice of suitable scoring matrices

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 14: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

ITERATIVE MULTIPLE SEQUENCE ALIGNMENT

Try to correct for the dependency on the most similar sequences in progressive methods

Repeatedly realigning subgroups, then aligning these on the global alignment Based in tree ordering, separation of

sequences, or random grupo selection

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 15: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

ITERATIVE MULTIPLE SEQUENCE ALIGNMENT

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 16: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected] MULTIPLE SEQUENCE ALIGNMENT

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

D1

Page 17: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MULTIPLE SEQUENCE ALIGNMENT - PROGRAMS

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 18: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MULTIPLE SEQUENCE ALIGNMENT - OVERVIEW

Page 19: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PHYLOGENY ANALYSIS AND PREDICTION FROM DNA/PROTEIN SEQUENCES

Determination of how the family might have been derived during evolution

Sequences is depicted as branches on a tree

Very similar sequences are located as neighbours in a branch

The goal is to discover all the branching relationships and the branch lengths

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 20: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PHYLOGENY ANALYSIS AND PREDICTION FROM DNA/PROTEIN SEQUENCES

Phylogenetic relationships among the genes can help to predict which ones might have an equivalent function.

Phylogenetic analysis may also be used to follow the changes occurring in a rapidly changing species, such as a virus

Important for discovering function, 3D structure, localization, modification,

interaction, regulation/control, transcriptional regulation

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 21: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PHYLOGENY ANALYSIS AND PREDICTION FROM DNA/PROTEIN SEQUENCES

Related to SEQUENCE ALIGNMENT

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 22: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 23: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

GENOME COMPLEXITY

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 24: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

GENOME COMPLEXITY

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 25: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EVOLUTIONARY TREE

An evolutionary tree is a two-dimensional graph showing evolutionary relationships among organisms

The separate sequences are referred to as taxa (singular taxon), defined as phylogenetically distinct units on the tree

The tree is composed of outer branches (or leaves) representing the taxa and nodes and branches representing relationships among the taxa

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 26: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EVOLUTIONARY TREE

A and B are derived from a common ancestor

each node in the tree represents a splitting of the evolutionary path of the gene into two different species that are isolated reproductively

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 27: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EVOLUTIONARY TREE

Beyond spliting, any further evolutionary changes in each new branch are independent of those in the other new branch

The length of each branch to the next node represents the number of sequence changes that occurred prior to the next level of separation

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 28: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EVOLUTIONARY TREE

Uniform mutation rate Molecular Clock Hypothesis, suitable for closely related species

Special cases could use non-uniform rates

The root is defined by including a taxon that we are reasonably sure branched off earlier than the other

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 29: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EVOLUTIONARY TREE

The sum of all the branch lengths in a tree is referred to as the tree length.

The tree is also a bifurcating or binary tree, in that only two branches emanate from each node.

Trees can have more than one branch emanating from a node if the events separating taxa are so close that they cannot be resolved, or to simplify the tree.

The unrooted tree also shows the evolutionary relationships among sequences A–D, but it does not reveal the location of the oldest ancestry.

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 30: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EVOLUTIONARY TREE

The number of possible rooted trees increases very rapidly with the number of sequences or taxa

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 31: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

METHODS TO BUILD EVOLUTIONARY TREES

To find the evolutionary tree or trees that best account for the observed variation in a group of sequences

Maximum Parsimony Distance Maximum Likelihood

Page 32: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

METHOD SELECTION

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 33: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

CONSIDERATIONS

Not Large number of gaps Phylogenetic methods analyze

conserved regions that are represented in all the sequences (Local Alignments)

Page 34: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MAXIMUM PARSIMONY (OR MINIMUM EVOLUTION)

Predicts the evolutionary tree by minimizing the number of steps required to generate the observed sequence changes

Requires a multiple sequence alignment Method revise each informative position

and each possible tree same residue in at least two sequences but not

all Used for sequences that are quite similar

and for small number of sequences

Page 35: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MAXIMUM PARSIMONY (OR MINIMUM EVOLUTION)

Noninformative

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 36: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

DISTANCE METHODS

Employs the number of changes between each pair

Sequence pairs that have the smallest number of sequence changes are "neighbours" sharing a node in the tree

Very related to Multiple sequence alignment method (CLUSTALW) which produced DISTANCE MATRICES then analysed by distance methods

Remember Distance vs Similarity (and gaps)

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 37: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

DISTANCE METHODS

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

"Idealized"

Page 38: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

DISTANCE ALGORITHMS

Fitch and Margoliash Method Neighbor-joining Method Unweighted Pair Group Method

with Arithmetic Mean (UPGMA)

Page 39: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

DISTANCE ALGORITHM

Choosing a outgroup (Grupo Fuera) improves prediction because methods are informed about the "order" of the outgroup

Page 40: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MAXIMUM LIKELIHOOD

Uses probability of the number of sequence changes

Analysis is performed for each informative residue (like in maximum parsimony)

All possible trees are considered (so, for small number of sequences)

Consider variations in mutation rates, so it can be used for most distant sequences

Main disadvantage: Computation Time

Page 41: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

MAXIMUM LIKELIHOOD

Needs a model that provides estimates of substitution rates for each residue pair

Page 42: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

RELIABILITY OF PHYLOGENETIC PREDICTIONS

Bootstrap method randomly resampling residues within columns (robustness test) Good evidence if more than 70%

predictions are conserved then Collapse branches and confirm tree

length Compare distinct methods and

parameters

Page 43: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

"CLASSIC" PROGRAMS

PHYLIPhttp://evolution.genetics.washington.edu/phylip.html

PAUPhttp://paup.csit.fsu.edu/downl.html

Phylemonhttp://phylemon.bioinfo.cipf.es/cgi-bin/tools.cgi

Page 44: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PHYLEMON WEB SERVICE

Page 45: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PROGRAMS – WEB SERVICES

http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory

Page 46: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PROGRAMS – WEB SERVICES

http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory

Page 47: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

BOOK

Page 48: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

EXERCISE/HOMEWORK

Select a gene Get the sequence in at least 7 species Select a site (Phylemon) Perform the multiple sequence alignment

(ClustalW) Perform Phylogeny to obtain a tree

At least 2 tree methods At least 3 parameter(s) changes Take DNA/Protein

Report results and discussion

12 MSA+Trees

Page 49: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PAPERS TO REVISE

Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis – Loytynoja, Goldman, Science 2008

Insertions and deletions treated as different events

Page 50: Multiple Sequence Alignments and Phylogeny. vtrevino@itesm.mx  Within a protein sequence, some regions will be more conserved than others. As more conserved,

[email protected]

PAPERS PENDING FOR THIS SESSION