Phylogeny. Reconstructing a phylogeny The phylogenetic tree (phylogeny) describes the evolutionary...

Post on 20-Dec-2015

236 views 2 download

Tags:

Transcript of Phylogeny. Reconstructing a phylogeny The phylogenetic tree (phylogeny) describes the evolutionary...

Phylogeny Phylogeny

Reconstructing a phylogenyReconstructing a phylogeny The phylogenetic tree (phylogeny) describes The phylogenetic tree (phylogeny) describes

the evolutionary relationships between the the evolutionary relationships between the studied datastudied data

The data must be comprised of homologous The data must be comprised of homologous typestypes

In molecular evolution, the studied data are In molecular evolution, the studied data are homologous DNA/AA sequenceshomologous DNA/AA sequences

Phylogeny reconstruction explicitly assumes Phylogeny reconstruction explicitly assumes that the sequences are alignedthat the sequences are aligned

INPUT = MSA

Reminder: MSA and phylogeny Reminder: MSA and phylogeny are dependentare dependent

Inaccurate guide tree

MSA

Sequence alignment

0.4

Phylogeny reconstruction

Unaligned sequences

Phylogeny representation

CA D

Textual representation (Newick format)

B

• Each pair of parenthesis () encloses a clade in the tree • A comma “,” separates the members of the corresponding clade• A semicolon “;” is always the last character

Visual representation

((A,C),(B,D));

Some terminology

root

internal branches

(splits)

internal nodes External nodes (leaves)

monophyletic group (clade)

External branches

Neighbors

Neighbors

Chimp HumanGorillaHuman ChimpGorilla

=

Chimp GorillaHuman

= =

Human GorillaChimp

(Gorilla,(Human,Chimp)) = (Gorilla,(Chimp,Human))

= ((Human,Chimp),Gorilla) = ((Chimp,Human),Gorilla)

Swapping neighbors is meaningless

1

2

3A

B

C

1

CBA

2

BCA

3

ABC

Rooted vs. unrooted

1

2

3A

B

C

1

CBA

2

BCA

3

ABC

((C,B),A) ((A,B),C)

((A,C),B)(A,B,C)

In newick format

How can we root a tree?

Rooting the tree based on a priori knowledge: using an outgroup

Human ChimpChicken Gorilla

INGROUPOUTGROUP

HumanChimpGorilla

Chicken

Human

Chimp

Chicken

Gorilla

The outgroup should be close enough for detecting sequence homology, but far enough to be a clear outgroup

The gene tree is not always identical to the species tree

Gorilla

Chimp

Chicken

Human

Gorilla ChimpChicken Human Human ChimpChicken Gorilla

Gene tree

Species tree

Phylogeny reconstruction approaches

Distance based methods: Neighbor Joining

B

D

AC

E

AD

C

EB

A,B

B

D

AC

E

ABCDEA02344B0345C034D05E0

A,BCDEA,B02.54.53.5C034D05E0

The Minimum Evolution (ME) criterion: in each iteration we separate the two sequences which result with the minimal sum of branch lengths

Maximum Parsimony: finds the most parsimonious topology

Seq 1:

Seq 2:

Seq 3:

Seq 4:

1 3 2 4 1 4 2 3 1 2 3 4

Phylogeny reconstruction approaches

1 3 2 4 1 4 2 3 1 2 3 4

P(Data|T)

Maximum Likelihood: finds the most likely topology

Topology search methods: MP, ML

Distance based methods Distance based methods Neighbor Joining (e.g., using ClustalX)Neighbor Joining (e.g., using ClustalX)

FastFast InaccurateInaccurate

Topology search methods Topology search methods Maximum parsimony (e.g., using Maximum parsimony (e.g., using MEGAMEGA))

× CrudeCrude× Questionable statistical basisQuestionable statistical basis

Maximum likelihood (e.g., using Maximum likelihood (e.g., using RAxMLRAxML, , phyMLphyML))AccurateAccurate SlowSlow

Bayesian methods Bayesian methods Monte Carlo Markov Chains (MCMC) (e.g., using Monte Carlo Markov Chains (MCMC) (e.g., using MrBayesMrBayes))

Most accurateMost accurate Very slowVery slow

Phylogeny reconstruction approaches: summary

How robust is our treeHow robust is our tree??

Human GorillaChimp

We need some statistical way to We need some statistical way to estimate the confidence in the estimate the confidence in the tree topologytree topology

But we don’t know anything But we don’t know anything about the distribution of tree about the distribution of tree topologiestopologies

The only data source we have is The only data source we have is our data (MSA)our data (MSA)

So, we must rely on our own So, we must rely on our own resources: resources: “pull up by your “pull up by your own bootstraps”own bootstraps”

Bootstrap for estimating robustness

Bootstrap1. Create n (100-1000) new MSAs (pseudo-MSAs) by randomly sampling K positions from our original MSA with replacement

12345 K1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C 4 : ACCTA…T

11244…31 : AATTT…C2 : AATTT…C3 : AACTT…T4 : AACTT…C

97478…101 : TTTTA…T2 : CATAC…A3 : CATAC…T4 : AGTGG…A

51578… 121 : GAGTA…T2 : GAGAC…G3 : AAAAC…A4 : AAAGG…C

Sp1Sp2

Sp3

Sp4

Bootstrap2. Reconstruct a pseudo-tree from each pseudo-MSA with the same method used for reconstructing the original tree

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

11244…31 : AATTT…C2 : AATTT…C3 : AACTT…T4 : AACTT…C

97478…101 : TTTTA…T2 : CATAC…A3 : CATAC…T4 : AGTGG…A

51578… 121 : GAGTA…T2 : GAGAC…G3 : AAAAC…A4 : AAAGG…C

Bootstrap3. For each split in our original tree, we count the number of times it appeared in the pseudo-trees Sp1

Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3Sp4

Sp1Sp2

Sp3

Sp4

67%100%

In 67% of the pseudo-trees, the split between SP1+SP2 and the rest of the tree was found

In general bp support < 80% is considered low

ClustalX: NJ phylogeny reconstruction

ClustalX: NJ phylogeny reconstruction

http://phylobench.vital-it.ch/raxml-bb//

Viewing the tree with njPlot

Note :unrooted

tree

Defining an outgroup

Swapping nodes

Bootstrap support

FigTree: tree visualization and figure creationhttp://tree.bio.ed.ac.uk/software/figtree/

Reconstructing the tree of lifeReconstructing the tree of life

Darwin’s vision of the tree of life Darwin’s vision of the tree of life from the from the Origin of SpeciesOrigin of Species

The three-domain tree of life based The three-domain tree of life based on SSU rRNA MSAon SSU rRNA MSA

But branching of several But branching of several kingdoms remain in disputekingdoms remain in dispute

Lateral Gene Transfer (LGT) Lateral Gene Transfer (LGT) challenges the conceptual basis of challenges the conceptual basis of

phylogenetic classificationphylogenetic classification

MethodologyMethodology Started with 36 genes universally present in 191 Started with 36 genes universally present in 191

species (spanning all 3 domains of life), for species (spanning all 3 domains of life), for which orthologs could be unambiguously which orthologs could be unambiguously identifiedidentified

Eliminated 5 genes that are LGT suspects Eliminated 5 genes that are LGT suspects (mostly tRNA synthetases)(mostly tRNA synthetases)

Constructed an MSA for each of the 31 Constructed an MSA for each of the 31 orthogroupsorthogroups

Concatenated all 31 MSAs to a super-MSA of Concatenated all 31 MSAs to a super-MSA of 8090 columns8090 columns

The phylogeny was reconstructed based on the The phylogeny was reconstructed based on the super-MSA using the maximum likelihood super-MSA using the maximum likelihood approachapproach

Archaea

Eukaryota

Bacteria

http://itol.embl.de

Tree supportTree support

81.7% of the splits show bootstrap support 81.7% of the splits show bootstrap support of over 80%of over 80%

65% of the split show bootstrap support of 65% of the split show bootstrap support of 100%100%

However, several deep splits show low However, several deep splits show low supportssupports

Still, the debate goes onStill, the debate goes on

““Tree of one percent of lifeTree of one percent of life”” Ciccarelli et al. on the one hand favor the claim

that bacteria adhere to a bifurcating tree of life, given that the small amount of LGT genes are filtered

On the other hand, their filtering process left only 31 proteins, which represent ~1% of an average prokaryotic proteome and ~0.1% of a large eukaryotic proteome

““If throwing out all non-universally distributed If throwing out all non-universally distributed genes and all LGT suspects leaves a 1% tree, then genes and all LGT suspects leaves a 1% tree, then we should probably abandon the tree as a working we should probably abandon the tree as a working hypothesis” hypothesis”