Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20...

25
Phylogenetics Alexei Drummond

Transcript of Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20...

Page 1: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Phylogenetics

Alexei Drummond

Page 2: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

2CS369 2007

Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there?

(A) 2027025

(B) 34459425

(C) 8.20 1021

(D) 3.21 1070

Bonus question: What about unrooted trees?

Page 3: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 3

Computational Biology

Multiple sequence alignment

Global Local

Evolutionary tree reconstruction

Substitution matrices

Pairwise sequence alignment (global and local)

Database searching

BLAST

Sequence statistics

Adapted from slide by Dannie Durant

Page 4: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Molecules as Documents of Evolutionary History

• Macromolecules contain information about the processes and history that formed them

HIV-1 (UK) ATCGGATGCTAAAGCATATGACACAGAGGTACATAATGTTTHIV-1 (USA) ATCAGATGCTAGAGCTTATGATACAGAGGTACA---TGTTT

• However, this information is often fragmentary, camouflaged or lost completely

• One of the aims of computational biology is to recover as much of this information as possible and decipher its meaning

Page 5: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Phylogenetics

• Views similarity (homology) as evidence of common ancestry– Homology: similarity that is the result of inheritance from a

common ancestor

• Uses tree diagrams to portray relationships based upon recency of common ancestry

• Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group

• Phylogenetics has in recent years become a statistical science based on probabilistic models of evolution.

Page 6: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Bacterium 1

Bacterium 3Bacterium 2

Eukaryote 1

Eukaryote 4Eukaryote 3Eukaryote 2

Bacterium 1

Bacterium 3Bacterium 2

Eukaryote 1

Eukaryote 4Eukaryote 3

Eukaryote 2

Types of Phylogenies

• Cladograms show clusters– Branch lengths are

meaningless

• Phylograms show clusters and branch lengths– Branch lengths can represent time

or genetic distance– Vertical dimension is meaningless

Page 7: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Rooting trees using an outgroup

archaea

archaea

archaea

eukaryote

eukaryote

eukaryote

eukaryote

bacteria outgroup

root of ingroup

eukaryote

eukaryote

eukaryote

eukaryote

archaea

archaea

archaea

Monophyletic

Group (clade)

Unrooted tree

Rooted by outgroup

Monophyletic

Group (clade)

Page 8: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 8

Anatomy of a tree

Bacterium 1

Bacterium 3

Bacterium 2

Eukaryote 1

Eukaryote 4

Eukaryote 3

Eukaryote 2

External branch or edge

Internal branch or edge

Internal node External node or tip

Taxon

Root

Page 9: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Problems in Phylogenetics

• Correctly aligning multiple sequences• Choosing an evolutionary model of

sequence change– To estimate the genetic distance between

sequences

• Inferring phylogenetic trees• Testing evolutionary hypotheses

– (we won’t cover this material in 369)

Page 10: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

45678910

2048136

15105945

10395135135202702534459425

8.20 1021 3.21 1070

2.11 10267

enumerable by handenumerable by hand on a rainy dayenumerable by computerstill searchable very quickly on computera bit more than the number of hairs on your headGreater than the population of Auckland≈ upper limit for exhaustive searching; about the number

of possible combinations of numbers in the UK National Lottery≈ upper limit for branch-and-bound searching≈ the number of particles in the universenumber of trees to choose from in the “Out of Africa” data (Vigilant et al., 1991)

n #trees

How many trees are there?

For n taxa there are (2n – 3)! = (2n – 3)(2n – 5)...(3)(1)

rooted, binary trees:

Page 11: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

ABCDE

0

1

00

00

0

00

11

0

00

00

0

10

00

1

00

10

1

00

11

0

00

11

1

00

11

1

00

11

1

10

1

1 2 3 4 5 6 7 8 9 01

Characters

Taxa

ABCDE

Taxa

Distances

Phylogenetic Reconstruction

• There are essentially two types of data for phylogenetic tree estimation:– Distance data, usually stored in a distance matrix, e.g. DNA×DNA

hybridisation data, morphometric differences, immunological data, pairwise genetic distances

– Character data, usually stored in a character array;• e.g. multiple sequence alignment of DNA sequences, morphological

characters.

Page 12: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Phylogenetic Reconstruction

• Given the huge number of possible trees even for small data sets, we have two options:– Build one according to some clustering

algorithm– Assign a “goodness of fit” criterion (an

objective function) and find the tree(s) which optimise(s) this criterion

Page 13: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 13

DistancesNucleotide

Sites

Type of Data

UPGMA

Neighbor-Joining

Minimum Evolution

Maximum Parsimony

Maximum Likelihood

Tre

e B

uil

din

g M

eth

od

Op

tim

alit

yC

rite

rio

nC

lust

erin

gA

lgo

rith

m

Phylogenetic Reconstruction

Page 14: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

Clustering Algorithms

• The clustering algorithms are usually very fast, and simple but– there is no explicit optimality criterion, so

• we have no measure of how good the tree is!• we do not get any idea about other potential trees –

were there any better trees?

• Common methods are Neighbour-Joining and UPGMA.

Page 15: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

A B

Node 1

* NJ uses rate-corrected distances

Clustering Algorithms

• The UPGMA and neighbor-joining (NJ) methods are both greedy heuristics which join, at each step, the two closest* sub-trees that are not already joined.

• They are based on the minimum evolution principle.• An important concept in both of these methods is a pair of

neighbors, which is defined as two nodes that are connected via a single node:

Page 16: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 16

UPGMA Example

A B C D

A 0

B 8 0

C 7 9 0

D 12 14 11 0

A

C

3.5

3.5

dB (AC ) = (dBA + dBC ) /2 = (8 + 9) /2 = 8.5

dD(AC ) = (dDA + dDC ) /2 = (12 +11) /2 =11.5

Page 17: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 17

UPGMA Example

d(ABC )D = (dAD + dBD + dCD ) /3 = (12 +14 +11) /3 ≈12.33

AC B D

AC 0

B 8.5 0

D 11.5 14 0

A

C

3.5

3.5

B4.25

0.75

Page 18: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 18

UPGMA Example

A

C

3.5

3.5

B4.25

0.75

ABC D

ABC 0

D 12.33 0

1.92

6.17

D

Page 19: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 19

UPGMA weaknesses

A B C D

A 0

B 8 0

C 7 9 0

D 12 14 11 0

A

B

3

5

C3

12

6

D

There is a (non clock-like) tree that fits the distancematrix exactly!

Page 20: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 20

UPGMA properties

• UPGMA assumes that the rates of evolution are clock-like.– Assumes the rate of substitution is the same on

all branches of the tree

• Produces a rooted tree

Page 21: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 21

Neighbor-joining

• Most widely-used distance based method for phylogenetic reconstruction

• UPGMA illustrated that it is not enough to pick the closest neighbors (at least when there is rate heterogeneity across branches)

• Idea: take into account averaged distances to other leaves as well

• Produces an unrooted tree

Page 22: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 22

The basic idea

• We start by moving every node i closer to all other nodes by this amount:

ri =1

n −2dij

j

• As a result the new (squashed) distances are:

dij* = dij − ri − rj

• We are pushing node i closer to all other nodes by an amount slightly more than the average distance to all other taxa.

Page 23: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 23

The basic idea

• In effect, the nodes that were far away from everything get pushed towards everything quite a lot.

• This counteracts the effect of long branches.€

dij* = dij − ri − rj

A

B

C

D

0.3

0.3

0.1

0.1

0.1

UPGMA would incorrectly group A and B, whereas NJ would reconstruct the correct tree in this case.

Page 24: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 24

Neighbor-joining

• We use an algorithm very similar to UPGMA to connect the two closest nodes, i and j, using these new squashed distances.

• We join these into a cluster and make a new node k to correspond to their ancestor, and pick distances from i, j and all other nodes to k.

• The squashed distances are updated at each step.

• See Durbin book, p171 for details.

Page 25: Phylogenetics Alexei Drummond. CS369 20072 Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) 2027025 (B) 34459425.

CS369 2007 25

Runtime of the algorithm

• Both of these clustering-based algorithms take O(n3) time once we have the distance matrix.

• There are n steps and in each step we do:– (1) find the smallest distance– (2) join these two taxa– (3) compute the distance from the new ancestor to all

others

• Step (1) takes O(n2) and the other two steps take O(n)