Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions...

39
Dina El-Khishin (Ph.D.) Deputy Director of AGERI & Head of the Genomics, Proteomics & Bioinformatics Research Facility Agricultural Genetic Engineering Research Institute (AGERI) Giza EGYPT

Transcript of Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions...

Page 1: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Dina El-Khishin (Ph.D.)

Deputy Director of AGERI

&

Head of the Genomics, Proteomics

&

Bioinformatics Research Facility

Agricultural Genetic Engineering Research Institute (AGERI)

Giza

EGYPT

Page 2: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Phylogeny

Bibliotheca Alexandrina

December 2007

What is Phylogeny?• A description how genes, proteins or species

are related within families

•It assumes the objects under investigation are related through evolution

Page 3: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

The classical approach to phylogeny uses morphological characteristics to study the relationship between species

The classical approach to phylogeny uses morphological characteristics to study the relationship between species

The classical approach to phylogeny uses morphologicalcharacteristics to study the relationship between species

The classical approach to phylogeny uses morphologicalcharacteristics to study the relationship between species

Page 4: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

What is Molecular Phylogeny?

• The use of molecular data to establish the relationship between species, organisms or gene families

• Data can be a variety of characteristics, such as:– Protein sequences– DNA hybridisation– Gene frequencies– Codon usage

– DNA sequences– Immunological data– Restriction patterns– Gaps in sequences

Historical background

•Studies of molecular evolution began with the first sequencing of proteins, beginning in the 1950s.•In 1953 Frederick Sanger and colleagues determined the primary amino acid sequence of insulin.•(The accession number of human insulin is NP_000198)

Page 5: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Sanger and colleagues sequenced insulin (1950s)

We can make a multiple sequence alignment of insulinsfrom various species, and see conserved regions…

Goals of molecular phylogeny

Phylogeny can answer questions such as:

• How many genes are related to my favorite gene?• Was the extinct quagga more like a zebra or a horse?• Was Darwin correct that humans are closestto chimps and gorillas?• How related are whales, dolphins & porpoises to cows?• Where and when did HIV originate?• What is the history of life on earth?

Page 6: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Was the quagga (now extinct) more like a zebra or a horse?

Page 7: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Molecular phylogeny: nomenclature of trees

There are two main kinds of information inherent to any tree:

1-topology2-branch lengths.

We will now describe the parts of a tree.

Molecular phylogeny uses trees to depict evolutionary relationships among organisms.

These trees are based upon DNA and protein sequence data.

Page 8: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree nomenclature

Tree nomenclature

Page 9: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree nomenclature

Tree nomenclature: clades

Page 10: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree nomenclature

Tree nomenclature

Page 11: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree rootsThe root of a phylogenetic tree represents thecommon ancestor of the sequences.

Some trees are unrooted, and thus do not specify the common ancestor.

A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other OTUs).

Tree nomenclature: roots

Rooted tree (specifies evolutionary path)The tree reflects the branching pattern (order of evolutionary events) starting from the oldest common ancestor

Unrooted:The tree only shows the evolutionary relationshipsbetween the descendants

Page 12: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree nomenclature: outgroup rooting

ab

c

d

a

b

c

d

d

b

c

a

Tree Representations

These trees are all equivalent!

These trees are all equivalent!

d

b

c

a

Page 13: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

•Molecular evolutionary studies can be complicated by the fact that both species and genes evolve.•Speciation usually occurs when a species becomes reproductively isolated. •In a species tree, each internal node represents a speciation event.

•Genes (and proteins) may duplicate or otherwise evolve before or after any given speciation event. •The topology of a gene (or protein) based tree may differ from the topology of a species tree.

Species trees versus gene/protein trees

Species trees versus gene/protein trees

Page 14: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Species trees versus gene/protein tree

man rat chicken man rat chicken

α hemoglobin β hemoglobin

Biological terms

radiation,speciation

gene duplication

Page 15: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Orthology vs Paralogy

• Orthologs: the sequences have diverged by speciations– E.g. human, mouse and chicken α

hemoglobin• Paralogs: the sequences have diverged

by gene duplication– E.g. the α and β hemoglobin genes

• Xenologs: genes acquired by horizontal gene transfer

rat man chicken

Hemoglobin tree

This tree is biologically improbable!This tree is biologically improbable!This tree is biologically improbable,unless we realise what is actually happening here!

This tree is biologically improbable,unless we realise what is actually happening here!

Page 16: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Hemoglobin tree

Hbα Hbβ

Hemoglobin tree

So make sure that you are using orthologs, or at least be aware that your data set may contain a mixture of orthologs and paralogs!

So make sure that you are using orthologs, or at least be aware that your data set may contain a mixture of orthologs and paralogs!

Hbα Hbβ

Page 17: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

man rat chicken man rat chicken

α hemoglobin β hemoglobin

paralogs

Biological terms

orthologs

radiation,speciation

gene duplication

man rat chicken man rat chicken

α hemoglobin β hemoglobin

ancestorbranch

node/HTU

root

descendanttip/OTU

Technical terms

OTU = Operational Taxonomic UnitHTU = Hypothetical Taxonomic Unit

Page 18: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

man rat rabbit man rat chicken

α hemoglobin β hemoglobin

polytomy

Technical terms

monophyleticpolyphyletic

man rat rabbit man rat chicken

α hemoglobin β hemoglobin

polytomy

Technical terms bifurcation

multifurcation

dichotomy

Page 19: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

What Complicates Things?

• What makes the reconstruction of evolutionary trees difficult, is the same thing that makes alignments difficult: the occurrence of substitutions, insertions and deletions

• On the other hand, this is also what makes it possible to reconstruct evolution: the differences between species (or sequences)

Molecular Phylogeny

• The Goal:– to find the true evolutionary tree

• The Method:– try to find the tree(s) that optimally satisfy

the data to some evolutionary model• The Problem:

– we have to investigate all possible trees to find the best one(s)

Page 20: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

The Goal

• Find the true evolutionary tree of– species evolution– gene evolution

• This information can be used to– predict the function of genes and proteins– design or alter protein function– study the origin of life on earth– study reticulate (network) evolution

The Problem(s)

• Trying to fit our data to every possible tree topology rapidly becomes infeasible– This type of problem is called intractable or

NP-complete• The more realistic the evolutionary

model is, the more complex it is– For complex models it is difficult to

generate realistic estimates of model parameters

Page 21: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Four stages of phylogenetic analysis

Molecular phylogenetic analysis may be described in four stages:

[1] Selection of sequences for analysis[2] Multiple sequence alignment[3] Tree building[4] Tree evaluation

Stage 1: Use of DNA, RNA, or protein

For some phylogenetic studies, it may be preferable to use protein instead of DNA sequences. We saw that in pairwise alignment and in BLAST searching, protein is often more informative than DNA.

Page 22: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Stage 2: Multiple sequence alignmen

The fundamental basis of a phylogenetic tree is a multiple sequence alignment.(If there is a misalignment, or if a nonhomologous sequence is included in the alignment, it will still be possible to generate a tree.)

Consider the following alignment of 13 orthologous retinol-binding proteins

Page 23: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Stage 2: Multiple sequence alignment

[1] Confirm that all sequences are homologous[2] Adjust gap creation and extension penalties as needed to optimize the alignment[3] Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data).[4] Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)

Page 24: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Stage 3: Tree-building methods•Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. •Examples of distance-based algorithms are UPGMA and neighbor-joining.

•Character-based methods include maximum parsimony and maximum likelihood. Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa.

Stage 3: Tree-building methods

• Distances

• Parsimony

• Maximum likelihood (ML)

• Bayesian (MCMC)

Page 25: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

We can introduce distance-based and character-based tree-building methods by referring to a tree of 13 orthologousretinol-binding proteins, and the multiple sequence alignment from which the tree was generated.

Orthologs:members of a gene (protein) family in variousorganisms.

Page 26: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Fish

Others

Page 27: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Regardless of whether you use distance-or character-based methods for building a tree, the starting point is a multiple sequence alignment.ReadSeq is a convenient web-based program that translates multiple sequence alignments into formats compatible with most commonly used phylogeny programs such as PAUP and PHYLIP.

Stage 3: Tree-building methods

Page 28: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Stage 3: Tree-building methods:distance

The simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences (Hamming distance) For an alignment of length N with n sites at which there are differences, the degree of divergenceD is:D = n / NThis not always very effective and more complex statistics can be used.

Tree-building methods: UPGMAUPGMA is unweighted pair group methodusing arithmetic mean

Page 29: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree-building methods: UPGMAStep 1: compute the pairwisedistances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree.

Tree-building methods: UPGMAStep 2: Find the two proteins with the smallest pairwise distance. Cluster them.

Page 30: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree-building methods: UPGMA

Step 3: Do it again. Find the next two proteins with the smallest pairwisedistance. Cluster them.

Tree-building methods: UPGMAStep 4: Keep going. Cluster.

Page 31: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree-building methods: UPGMA

Distance-based methods:UPGMA trees

UPGMA is a simple approach for making trees.• An UPGMA tree is always rooted.• While UPGMA is simple, it is less accurate than the neighbor-joining approach.

Page 32: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Making trees using neighbor-joiningThe neighbor-joining method of Saitou and Nei(1987) Is especially usefulfor making a tree having alarge number of taxa.

Begin by placing all the taxa in a star-like structure.

Tree-building methods:Neighbor joining

Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

Page 33: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Tree-building methods:character based

Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters).

Tree-building methods based on characters include maximum parsimony and maximum likelihood.

Tree-building methods: character based Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters).Tree-building methods based on characters include maximum parsimony and maximum likelihood.The main idea of character-based methods is to find the tree with the shortest branch lengths possible.Thus we seek the most parsimonious (“simple”) tree.• Identify informative sites. For example, constant characters are not parsimony-informative.• Construct trees, counting the number of changes required to create each tree. For about 12 taxa or fewer, evaluate all possible trees exhaustively; for >12 taxaperform a heuristic search.• Select the shortest tree (or trees).

Making trees using character-based methods

Page 34: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

As an example of tree-building using maximum parsimony, consider these four taxa:AAGAAAGGAAGAHow might they have evolved from a common ancestor such as AAA?

Tree-building methods: Maximum parsimon

In maximum parsimony, choose the tree(s) with thelowest cost (shortest branch lengths).

Page 35: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Phylogram(values are

proportionalto branchlengths)

Rectangularphylogram(values are

proportionalto branchlengths)

Page 36: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Cladogram(values are

notproportiona

lto branchlengths)

Rectangularcladogram

(values are notproportional

to branchlengths)

These four trees display the same data

in different formats.

Page 37: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Making trees using maximum likelihoo

Maximum likelihood is an alternative to maximum parsimony. It is computationally intensive. A likelihood is calculated for the probability of each residue in An alignment, based upon some model of the substitution process.ML is implemented in the TREE-PUZZLE program,as well as PAUP and PHYLIP.

Stage 4: Evaluating trees

The main criteria by which the accuracy of aphylogentic tree is assessed are consistency,efficiency, and robustness.

Evaluation of accuracy can refer to an approach (e.g. UPGMA) or to a particular tree.

Page 38: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Stage 4: Evaluating trees: bootstrappingBootstrapping is a commonly used approach to measuring the robustness of a tree topology.Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set?To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates.Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.

In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade.

Page 39: Phylogeny - Bibliotheca Alexandrina · Goals of molecular phylogeny Phylogeny can answer questions such as: • How many genes are related to my favorite gene? • Was the extinct

Towson Bioinformatics Portal• http://www.towson.edu/nalkharo/BioinformaticsPortal/

•Martti Tolvanen & Bairong ShenUniversity of Tampere

•Bioinformatics for Dummies (Wiley 2003)

•Internet Bioinformatics

• Nadim Alkharouf, Ph.D., Towson University