Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

43
Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees

Transcript of Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Page 1: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Day 8,9 Carlow Bioinformatics

Phylogenetic inferences

Trees

Page 2: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Why do trees?

 

Page 3: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Phylogeny 101

• OTUs operational taxonomic units: species, populations, individuals

• Nodes internal (often ancestors)Nodes external (terminal, often living species,

individuals)• Branches length scaled (length propn evo dist)

Branches length unscaled, nominal, arbitrary• Outgroup an OTU that is most distantly related

to all the other OTUs in the study.• Choose outgroup carefully

Page 4: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Phylogeny 102• Trees rooted N=(2n-3)! / 2n-2(n-2)!

Trees unrooted N=(2n-5)! / 2n-3(n-3)!OTUs #rooted trees #unrooted trees2 1 13 3 14 15 35 105 156 954 1057 10395 9548 135135 103959 2027025 13513510 34349425 202702520 34*106 8*1021

Page 5: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Four key aspects of treeA

DC

B A

B

C

D

Topology

Branch lengths

Root

Confidence

A

B

C

D

Basic tree

D

C

B

A

D

C

B

A

100

78

Page 6: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Methods

• Distance matrix– UPGMA– Neighbour joining NJ

• Maximum parsimony MP– tree requiring fewest changes

• Maximum likelihood ML– Most likely tree

• Bayesian: sort of ML– Samples large number of “pretty good” trees

Page 7: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Trees NJ

• Distance matrix

• UPGMA Unweighted Pair Group Method, with

Arithmetic means assumes constant rate of evolution – molecular clock: don’t publish UPGMA trees

• Neighbor joining is very fast

Often a “good enough” tree

Embedded in ClustalW

Use in publications only if too many taxa to compute with MP or ML

Page 8: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Distances from sequence

• Use Phylip Protdist or DNAdist• D= non-ident residues/total sequence length• Correction for multiple hits necessary because

• Jukes-Cantor assumes all subs equally likely• Kimura: transition rate NE transversion rate• Ts usually > Tv

G

A A

Page 9: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

UPGMA – pencil and paper trees• Two steps

1 find smallest distance in matrixcluster these 2 OTUsbranch length = half distance between OTUs

2 construct new distance matrix replacing the 2 OTUs with the clusterrecalculate distances as average of values compared(always use original matrix values)

• Iterate 1 2 1 2 1 2 1 2 until one distance remains

Page 10: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Steps in detailThe UPGMA method involves the successive clustering of the most closely related pairs of species (or groups of species). UPGMA assumes that sequences have evolved with a perfect molecular clock; because of this the tree is automatically rooted. A two-step procedure is repeated:  Step 1: Look through the matrix for the smallest pairwise distance value, and join these two species (or groups of species) into a cluster. Calculate the branch length from the common ancestor to each species, as one half of the distance between the two species (or groups). In later rounds, internal branch lengths are calculated by subtraction. Step 2: Construct a new pairwise distance matrix, in which the new cluster replaces the two species (or groups) within it. Calculate the distance values from this cluster to other species (or groups), as the average of the values for the species being compared. Now return to step one, and repeat until the distance matrix contains only one value. At that point you can draw the final tree.

Page 11: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Mammal dataset1> Spectacled bear Tremarctos ornatus 2> Giant panda Ailuropoda melanoleuca 3> Red panda Ailurus fulgens 4> Raccoon Procyon lotor 5> Ocelot Felis pardalis  mtDNA 16S rRNA 536 bp compared Numbers of nucleotide differences (above the diagonal), and percentage differences per site after correction for multiple hits by Jukes & Cantor's method (below the diagonal).

Addresses an old taxonomic puzzle is Red panda a bear or a raccoon?

Page 12: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Distance matrix

________________________________________________________  Bear G.panda R.panda Raccoon Ocelot ________________________________________________________ Bear -- 60 69 72 88Giant panda 12.1 -- 89 83 99 Red panda 14.1 18.8 -- 73 89 Raccoon 14.8 17.4 15.0 -- 90 Ocelot 18.5 21.2 18.8 19.0 -- ________________________________________________________ Round 1: cluster Bear and Giant panda @ 6.1 (rounded up to 1 deciplace)

Uncorrected distance 60/536 = 11.2

Page 13: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Round 2

New matrix: e.g. Bear+G.panda vs Red panda = (14.1+18.8)/2 = 16.45  ___________________________________________________ Be+Gp R.panda Raccoon Ocelot ___________________________________________________ Bear+G.panda -- Red panda 16.5 -- Raccoon 16.1 15.0 -- Ocelot 19.9 18.8 19.0 -- ___________________________________________________ Round 2: cluster Red panda and Raccoon @ 7.5

Page 14: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Round 3

_________________________________________ Be+Gp Rp+Rac Ocelot _________________________________________ Bear+G.panda -- R.panda+Racc. 16.3 -- Ocelot 19.9 18.9 -- _________________________________________ Round 3: cluster (Bear+Giant panda) and (Red panda+Raccoon) @ 8.2

Page 15: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Round 4 (final)

_________________________________ Be+Gp+Rp+Ra Ocelot _________________________________ Be+Gp+Rp+Ra -- Ocelot 19.4 -- _________________________________ Round 4: cluster (Bear+Giant panda+Red panda+Raccoon) and Ocelot @ 9.7

Page 16: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

TaDAAA the tree

Spect. bear

Giant panda

Red panda

Raccoon

Ocelot

6.1

6.1

7.5

7.5

9.7

2.1

0.7

1.5

8.2 – 7.5

9.7 = 1.5+2.1+6.1

Internal branches by subtraction

Page 17: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Trees MP• Maximum parsimony

• Minimum # mutations to construct tree

• Better than NJ – information lost in distance matrix – but much slower

• Sensitive to long-branch attraction– Long branches clustered together

• No explicit evolutionary model

• Protpars refuses to estimate branch lengths

• Informative sites

Page 18: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Long-branch attractionTrue tree

MusHBA MusHBB

HumHBBHumHBA

Rodents evolve fasterthan primates

False “LBA” treeMusHBA

MusHBB

HumHBA

HumHBB

Page 19: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Trees ML

• Very CPU intensive• Requires explicit model of evolution – rate

and pattern of nucleotide substitution– JC Jukes/Cantor – K2P Kimura 2 parameter transition/transversion– F81 Felsenstein – base composition bias– HKY85 merges K2P and F81

• Explicit model -> preferred statistically• Assumes change more likely on long branch

– So No long-branch attraction

• But Wrong model -> wrong tree

Page 20: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Models of sequence evolution

HKY85

A C G T

A C G T

C A G T

G A C T

T A C G

 

Page 21: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Bayesian methods• ML unsatisfactory because only best tree

identified

• Bayesian methods investigate a sample of highly likely trees

• MrBayes is the program

• Option to specify “prior probabilities” for– Tree topology (can force only “sensible” trees)– Branch lengths (usually equal lengths, but

rodents known to evolve faster than primates)– Rate matrix parameters

Page 22: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Maximum parsimony

Site: 1 2 3 4 5 6 7 8 9OTU1 A A G A G T G C AOTU2 A G C C G T G C GOTU3 A G A T A T C C AOTU4 A G A G A T C C G * * *

It is a good alignment clearly aligning homologous sites without gaps.

Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs

Page 23: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

There are 3 possible trees for 4 taxa (OTUs):

1 3 1 2 1 2 \_____/ \_____/ \_____/ / \ / \ / \ 2 4 3 4 4 3

Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3)

Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.

Page 24: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

The identical sites 1, 6, 8 are useless for phylogenetic purposes.

 

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 25: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Site 2 also useless: OTU1’s A could be grouped with any of the Gs.

Site: 1 2 3 4 5 6 7 8 9OTU1 A A G A G T G C AOTU2 A G C C G T G C GOTU3 A G A T A T C C AOTU4 A G A G A T C C G * * *

Page 26: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Site 4 is uniformative as each site is different.UNLESS transitions weighted in which case (1,4)(2,3)

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 27: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

For site 3 each tree can be made with (minimum) 2 mutations:

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 28: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

(1,2)(3,4)

G A G A G A

\ / \ / \ /

G---A C---A A---A

/ \ / \ / \

C A C A C A

Page 29: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

(1,3)(2,4)

G C can do worse:G C

\ / \ /

A---A G---A

/ \ / \

A A A A

Page 30: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

(1,4)(2,3)

G C

\ /

A---A

/ \

A A

So site 3 is (Counterintuitively) NOT informative

Page 31: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Site 5, however, is informative because one tree shortest.

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 32: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

(1,2)(3,4) (1,3)(2,4) (1,4)(2,3)

G A G G G G

\ / \ / \ /

G---A A---A G---G

/ \ / \ / \

G A A A A A

Page 33: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Likewise sites 7 and 9.By majority rule most parsimonious tree is

(1,2)(3,4) supported by 2/3 informative sites.

Site: 1 2 3 4 5 6 7 8 9

OTU1 A A G A G T G C A

OTU2 A G C C G T G C G

OTU3 A G A T A T C C A

OTU4 A G A G A T C C G

* * *

Page 34: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Protparsinfile:

8 370

BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER

RLR ---------- ---V-DKSKA LEAALSQIER

NGR ---------- -MSD-DKSKA LAAALAQIEK

ECO ---------- AIDE-NKQKA LAAALGQIEK

YPR ---------M AIDE-NKQKA LAAALGQIEK

PSE ---------- -MDD-NKKRA LAAALGQIER

TTH ---------- -MEE-NKRKS LENALKTIEK

ACD ---------- -MDEPGGKIE FSPAFMQIEG

Page 35: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Protpars

• treefile:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);

Page 36: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

• outfile:One most parsimonious tree found:

+-ACD +-------7 ! +-TTH +-6 ! ! +----PSE ! +----5 +-3 ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! +-------------NGR--1 ! ! +----------------RLR ! +-------------------BRU

remember: this is an unrooted tree!

requires a total of 853.000 steps

Page 37: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Clustalw

****** PHYLOGENETIC TREE MENU ******

1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Page 38: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

ClustalW trees

• Don’t use the .dnd file as a final tree– It’s only a temporary pairwise dendrogram/tree

• Always correct for mulitple hits/substs

• Usually toss all gaps

Page 39: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

ClustalW NJ• (((ACD:0.28958,

TTH:0.32705):0.03395,((BRU:0.07321,RLR:0.07032):0.11692,NGR:0.21168):0.02493):0.02092,(ECO:0.05022,YPR:0.05736):0.11997,PSE:0.15632);

• topologically the same as(((ACD,TTH),((BRU,RLR),NGR)),(ECO,YPR),PSE);

and compare to Protpars:(((((ACD,TTH),(PSE,(YPR,ECO))),NGR),RLR),BRU);

Page 40: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

NJ vs ProtPars

Page 41: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Dealing with CDSs

• More info in DNA than proteins• Systematic 3rd posn changes can confuse• Use DNA directly only if evol dist short• For distant relationships: blank 3rd positions• Translate into protein to align

– then copygaps back to DNA

• Use dnadist with weights to investigate rates

Page 42: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Trees

General guidelines – NOT rules

• More data is better

• Excellent alignment = few informative sites

• Exclude unreliable data – toss all gaps?

• Use seqs/sites evolving at appropriate rate– Phylip DISTANCE– 3rd positions saturated– 2nd positions invariant– Fast evolving seqs for closely related taxa– Eliminate transition - homoplasy

Page 43: Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.

Trees

• Beware base composition bias in unrelated taxa

• Are sites (hairpins?) independent?

• Are substitution rates equal across dataset?

• Long branches prone to error – remove them?– Choose outgroup carefully