An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2...

30
An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E. Weisstein Indiana State University March 11-14, 2004

Transcript of An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2...

Page 1: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

An Introduction to Phylogenetics

> Sequence 1GAGGTAGTAATTAGATCCGAAA…> Sequence 2GAGGTAGTAATTAGATCTGAAA…> Sequence 3GAGGTAGTAATTAGATCTGTCA…

Anton E. Weisstein

Indiana State UniversityMarch 11-14, 2004

Page 2: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Outline

I. Overview

II. Building and Interpreting Phylogenies

III. Evolutionary Inference

IV. Specific Applications

Page 3: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

What is phylogenetics?

Phylogenetics is the study of evolutionary relationships.

Relationships among species:

crocodiles

birds

lizards

snakesrodents

primates

marsupials

Page 4: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

What is phylogenetics?

Relationships among species:

crocodiles

birds

lizards

snakes

rodents

primates

marsupials

This is an example of a phylogenetic tree.

Page 5: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

What is phylogenetics?

Relationships within species:HIV subtypes

Rwanda

Ivory Coast

UgandaU.S.

U.S.

Italy

U.K.

India Rwanda

EthiopiaS. Africa

Uganda

Uganda

Tanzania

Romania

BrazilCameroon

Netherlands

NetherlandsTaiwan

Russia

A

B

C

D

F G

Page 6: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

So what is phylogeneticsgood for?

Phylogenetics has direct applications to:

• Conservation: test wood, ivory, meat products for poaching

• Agriculture: analyze specific differences between cultivars

• Forensics: DNA fingerprinting

• Medicine: determine specific biochemical function of cancer-causing genes

Page 7: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

1990 case: Did a patient’s HIV infection result from an invasive dental procedure performed by an HIV+ dentist?

HIV Example 1:Florida dentist case

Page 8: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Outline

I. Overview

II. Building and Interpreting Phylogenies

III. Evolutionary Inference

IV. Specific Applications

Page 9: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Phylogenetic concepts:Interpreting a Phylogeny

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Time

Which sequence is most closely related to B?

A, because B diverged from A more recently than from any other sequence.

Physical position in tree is not meaningful! Only tree structure matters.

Page 10: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Phylogenetic concepts:Rooted and Unrooted Trees

Time

A

B

C

D

Root =

A B

C D

Root

X

=?

A B

C D

?

? ?

? ?

X

Page 11: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

How Many Trees?

Unrooted trees Rooted trees

# sequences

# pairwise distances # trees

# branches /

tree # trees

# branches

/tree

3 3 1 3 3 4

4 6 3 5 15 6

5 10 15 7 105 8

6 15 105 9 945 10

10 45 2,027,025 17 34,459,425 18

30 435 8.69 1036 57 4.95 1038 58

N N (N - 1)

2

(2N - 5)!

2N - 3 (N - 3)!

2N - 3 (2N - 3)!

2N - 2 (N - 2)!

2N - 2

Page 12: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Types

Root

50 million years

sharks

seahorses

frogs

owls

crocodiles

armadillosbats

Evolutionary trees measure time.

Root

sharksseahorses

frogsowls

crocodilesarmadillos

bats5% change

Phylograms measure change.

Page 13: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Properties

Root

UltrametricityAll tips are an equal

distance from the root.X

Y

a

b

c de

a = b + c + d + e

Root

AdditivityDistance between any two tips equals the total branch

length between them.

X

Y

ab

c d

e

XY = a + b + c + d + e

In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.

Page 14: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Building Exercise

UltrametricityAll tips are an equal

distance from the root. Root

X

Y

a

b

c de

a = b + c + d + e

Using the distance matrix given, construct an ultrametric tree.

Page 15: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Phylogenetic Methods

Neighbor-joining• Minimizes distance between nearest neighbors

Maximum parsimony• Minimizes total evolutionary change

Maximum likelihood• Maximizes likelihood of observed data

Many different procedures exist. Three of the most popular:

Page 16: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Comparison of Methods

Neighbor-joining Maximum parsimony Maximum likelihood

Uses only pairwise distances

Uses only shared derived characters

Uses all data

Minimizes distance between nearest neighbors

Minimizes total distance

Maximizes tree likelihood given specific parameter values

Very fast Slow Very slow

Easily trapped in local optima

Assumptions fail when evolution is rapid

Highly dependent on assumed evolution model

Good for generating tentative tree, or choosing among multiple trees

Best option when tractable (<30 taxa, homoplasy rare)

Good for very small data sets and for testing trees built using other methods

Page 17: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Which procedure should we use?Neighbor-

joining

Maximumparsimony

Maximumlikelihood

All that we can!

?

• Each method has its own strengths

• Use multiple methods for cross-validation

• In some cases, none of the three gives the correct phylogeny!

Page 18: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Outline

I. Overview

II. Building and Interpreting Phylogenies

III. Evolutionary Inference

IV. Specific Applications

Page 19: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Phylogenetic concepts:Homology and Homoplasy

Homology: identical character due to shared ancestry (evolutionary signal)

Homoplasy: identical character due to evolutionary convergence or reversal (evolutionary noise)

lizards

snakes

rodentsprimates

+hair

Homology Homoplasy(Convergence)

birds

snakes

rodentsbats

+flight

+flight

Homoplasy(Reversal)

worms

lizardssnakes

+legs–legs

Page 20: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Watching the Molecular ClockMutation occurs as a random (Poisson) process. If mutations accumulate at a constant rate over time and across all branches, the phylogeny is said to obey a molecular clock.

% genetic difference

20012002

2001

2002

2000

Page 21: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Watching the Molecular ClockMutation occurs as a random (Poisson) process. If mutations accumulate at a constant rate over time and across all branches, the phylogeny is said to obey a molecular clock.

% genetic difference

BUT:• Natural selection favors some mutations and eliminates others• Selection varies over time and across lineages

2000

20012002

200120012002

2002

Page 22: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Trees are hypotheses about evolutionary history

So far, we’ve looked at understanding and formulating these hypotheses. Now, let’s turn our attention to testing them.

Page 23: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Testing:Split Decomposition

Split decomposition is one method for testing a tree.

A

B

C

D

A

D

B

C

A

C

B

D

Under this procedure, we choose exactly four taxa (A, B, C, D) and examine the topologies of all possible unrooted trees. How many such trees are there?

Only one of these topologies is right. How can we quantitatively assess the support for each tree?

Page 24: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Testing:Split Decomposition

The correct tree should be approximately additive; the others usually will not. For each tree, we calculate split indices that estimate the length of the internal branch:

+A

D

B

C+

A

C

B

D

2Large split indices Long internal branch Topology strongly supported

Small split indices Short internal branch Topology weakly supported

Negative split indices Biologically impossible Topology probably wrong

=

if A

C

B

Dis the right phylogeny!

Page 25: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Testing:Bootstrapping

Used to assess the support for individual branches

Randomly resample characters, with replacement

How often does a specific branch appear?

Repeat many times (1000 or more)

rathumanturtlefruit flyoakduckweed

100

98

73

Page 26: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Tree Testing:Bootstrapping

MacClade Example:

Vertebrate evolution

Page 27: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Outline

I. Overview

II. Building and Interpreting Phylogenies

III. Evolutionary Inference

IV. Specific Applications

Page 28: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

HIV Example 1:Florida dentist case

• 1990 case: Did a patient’s HIV infection result from an invasive dental procedure performed by an HIV+ dentist?

• HIV evolves so fast that transmission patterns can be reconstructed from viral sequence (molecular forensics).

• Compared viral sequence from the dentist, three of his HIV+ patients, and two HIV+ local controls.

Page 29: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

Florida dentist case

Page 30: An Introduction to Phylogenetics > Sequence 1 GAGGTAGTAATTAGATCCGAAA… > Sequence 2 GAGGTAGTAATTAGATCTGAAA… > Sequence 3 GAGGTAGTAATTAGATCTGTCA… Anton E.

So what do the results mean?

• 2 of 3 patients closer to dentist than to local controls. Statistical significance? More powerful analyses?

• Do we have enough data to be confident in our conclusions? What additional data would help?

• If we determine that the dentist’s virus is linked to those of patients E and G, what are possible interpretations of this pattern? How could we test between them?