CS 177 Phylogenetics I
description
Transcript of CS 177 Phylogenetics I
CS 177 Phylogenetics I
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Model of sequence evolution
Phylogenetic trees and networks
Cladistic and phenetic methods
Computer software and demos
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic Inference I
A science primer: Phylogeneticshttp://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Brown, S.M. (2000) Bioinformatics, Eaton Publishing, pp. 145-160
Brown, S.M.: Molecular Phylogeneticswww.med.nyu.edu/rcr/rcr/course/PPT/phylogen.ppt
Hillis, D.M.; Moritz, G. & Mable, B.K. (1996) Molecular Systematics, 2. Edition, Sinauer Associates, 655 pp.
Mount, D.W. (2001) Bioinformatics,Cold Spring Harbor Lab Press, pp.237-280
Recommended readings
(very) basic
advanced
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
CS 177 Phylogenetic Inference I
The theory of evolution is the foundation upon which all of modern biology is built
Evolution
From anatomy to behavior to genomics, the scientific method requires an appreciation of changes in organisms over time
It is impossible to evaluate relationships among gene sequences without taking into consideration the way these sequences have been modified over time
Ernst Haeckel (1834-1919)
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
CS 177 Phylogenetic Inference I
Similarity searches and multiple alignments of sequences naturally lead to the question
“How are these sequences related?”
and more generally:
“How are the organisms from which these sequences come related?”
Relationships
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Classifying Organisms
Nomenclature is the science of naming organisms
Evolution has created an enormous diversity, so how do we deal with it?
Names allow us to talk about groups of organisms.
- Scientific names were originally descriptive phrases; not practical
- Binomial nomenclature
> Developed by Linnaeus, a Swedish naturalist
> Names are in Latin, formerly the language of science
> binomials - names consisting of two parts
> The generic name is a noun.
> The epithet is a descriptive adjective.
- Thus a species' name is two words e.g. Homo sapiens
Carolus Linnaeus (1707-1778)
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Classifying Organisms
Taxonomy is the science of the classification of organisms
Taxonomy deals with the naming and ordering of taxa.
The Linnaean hierarchy:
1. Kingdom
2. Division
3. Class
4. Order
5. Family
6. Genus
7. Species
Ta xo no m ic C la ssific a tio n o f M a n Ho m o sa p ie ns
Sup e rking d o m : Euka ryo ta King d o m : M e ta zo a Phylum : C ho rd a ta C la ss: M a m m a lia O rd e r: Prim a ta Fa m ily: Ho m inid a e G e nus: Sp e c ie s:
Ho m osa p ie ns
Sub sp e c ie s: sa p ie ns Evol u
tionary
di s
tanc e
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Systematics is the science of the relationships of organisms
Systematics is the science of how organisms are related and the evidence for those relationships
Systematics is divided primarily into phylogenetics and taxonomy
Speciation -- the origin of new species from previously existing ones
- anagenesis - one species changes into another over time
- cladogenesis - one species splits to make two
Classifying Organisms
Reconstruct evolutionary history
Phylogeny
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetics
Review of protein structures
Need for analyses of protein structures
Sources of protein structure information
Computational Modeling
Phylogenetics is the science of the pattern of evolution.
A. Evolutionary biology is the study of the processes that generate diversity, while phylogenetics is the study of the pattern of diversity produced by
those processes.
B. The central problem of phylogenetics:
1. How do we determine the relationships between species?
2. Use evidence from shared characteristics, not differences
3. Use homologies, not analogies
4. Use derived condition, not ancestral
a. synapomorphy - shared derived characteristic
b. plesiomorphy - ancestral characteristic
C. Cladistics is phylogenetics based on synapomorphies.
1. Cladistic classification creates and names taxa based only on synapomorphies.
2. This is the principle of monophyly
3. monophyletic, paraphyletic, polyphyletic
4. Cladistics is now the preferred approach to phylogenyThe phylogeny and classification of life as proposed by Haeckel (1866)
Phylogenetics
Evolutionary theory states that groups of similar organisms are descendedfrom a common ancestor.
Phylogenetic systematics is a method of taxonomic classification basedon their evolutionary history.
It was developed by Hennig, a German entomologist, in 1950.
Willi Hennig (1913-1976)
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetics
Phylogenetics is the science of the pattern of evolution
Evolutionary biology versus phylogenetics
- Evolutionary biology is the study of the processes that generate diversity
- Phylogenetics is the study of the pattern of diversity produced by those processes
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetics
Who uses phylogenetics? Some examples:
Evolutionary biologists (e.g. reconstructing tree of life)
Systematists (e.g. classification of groups)
Anthropologists (e.g. origin of human populations)
Forensics (e.g. transmission of HIV virus to a rape victim)
Parasitologists (e.g. phylogeny of parasites, co-evolution)
Epidemiologists (e.g. reconstruction of disease transmission)
Genomics/Proteomics (e.g. homology comparison of new proteins)
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic trees
The central problem of phylogenetics:
how do we determine the relationships between taxa?
in phylogenetic studies, the most convenient way of presenting evolutionary relationships among a group of organisms is the phylogenetic tree
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic trees
Sp e c ie s A
Sp e c ie s E
Sp e c ie s D
Sp e c ie s C
Sp e c ie s B
Node: a branchpoint in a tree (a presumed ancestral OTU)
Branch: defines the relationship between the taxa in terms of descent and ancestry
Topology: the branching patterns of the tree
Branch length (scaled trees only): represents the number of changes that have occurred in the branch
Root: the common ancestor of all taxa
Clade: a group of two or more taxa or DNA sequences that includes both their common ancestor and all their descendents
Operational Taxonomic Unit (OTU): taxonomic level of sampling selected by the user to be used in a study, such as individuals, populations, species, genera, or bacterial strains
Root
Branch
CladeNode
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic trees
There are many ways of drawing a tree
A
E
D
C
B
A EDCB
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic trees
There are many ways of drawing a tree
=
A EDCB E DC B A
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
=
E CD B A
Phylogenetic trees
There are many ways of drawing a tree
A EDCBA EDCB
= =
A EDCB
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
no meaning
Phylogenetic trees
There are many ways of drawing a tree
A EDCB A EDCB
Bifurcation
Trifurcation
=/
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Bifurcation versus Multifurcation (e.g. Trifurcation)
Multifurcation (also called polytomy): a node in a tree that connects more than three branches. A multifurcation may represent a lack of resolution because of too few data available for inferring the phylogeny (in which case it is said to be a soft multifurcation) or it may represent the hypothesized simultaneous splitting of several lineages (in which case it is said to be a hard multifurcation).
Phylogenetic trees
Trees can be scaled or unscaled (with or without branch lengths)
A
E
D
C
B
A
E
D
C
B
A
E
D
C
B
A
E
D
C
B
unit
unit
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic trees
Trees can be unrooted or rooted
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
D
A C
B
Unrooted tree
A CB D
Root
Rooted tree
D
A C
B
Root
A CB D
Root
Root
Phylogenetic trees
Trees can be unrooted or rooted
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Unrooted tree
A C
B D
4
3
5
2
1
These trees show five different evolutionary relationships among the taxa!
Rooted tree 1
B
A
C
D
Rooted tree 2
A
B
C
D
Ro oted tree 3
A
B
C
D
Rooted tree 4
C
D
A
B
Ro oted tree 5
D
C
A
B
Phylogenetic trees
Possible evolutionary trees
Taxa (n) Unrooted/rooted
2
2 1/1
3 1/3
4 3/15
43
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Taxa (n):
Phylogenetic trees
Possible evolutionary trees
Taxa (n) rooted(2n-3)!/(2n-2(n-2)!)
unrooted(2n-5)!/(2n-3(n-3)!)
2 1 1
3 3 1
4 15 3
5 105 15
6 954 105
7 10,395 954
8 135,135 10,395
9 2,027,025 135,135
10 34,459,425 2,027,025
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetic trees
How to root?
Use information from ancestors
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
In most cases not available
A C
B D
4
3
5
2
1
Phylogenetic trees
How to root?
Use statistical tools will root trees automatically (e.g. mid-point rooting)
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
A C
B D
4
3
5
2
1
This must involve assumptions … BEWARE!
A
B
C
D
10
2
3
5
2
d (A ,D ) = 10 + 3 + 5 = 18
M idpoint = 18 / 2 = 9
Phylogenetic trees
How to root?
Using “outgroups”
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
A C
B D
4
3
5
2
1
outgroup
- the outgroup should be a taxon known to be less closely related to the rest of the taxa (ingroups)
- it should ideally be as closely related as possible to the rest of the taxa while still satisfying the above condition
Phylogenetic trees
Exercise: rooted/unrooted; scaled/unscaled
A EDCB
A
E
D
C
B
AE
DC
BA
E
D
C
B
A
E
D
C
B
A EDCB
A
ED
CB
F
Taxonomy and phylogenetics
Phylogenetic trees
Cladistic versus phenetic analyses
Homology and homoplasy
Phylogenetics
What are useful characters?
Use homologies, not analogies!
- Homology: common ancestry of two or more character states
- Analogy: similarity of character states not due to shared ancestry
- Homoplasy: a collection of phenomena that leads to similarities in character states for reasons other than inheritance from a common ancestor (e.g. convergence, parallelism, reversal)
Homoplasy is huge problemin morphology data sets!
But in molecular data sets, too!
Cactaceae and Euphorbiaceae
Taxonomy and phylogenetics
Phylogenetic trees
Homology and homoplasy
Cladistic versus phenetic analyses
Phylogenetics
Molecular data and homoplasy
260 * 280 * 300 * 320 0841r : CCTTCAATTTTTATT-----------------------AGAGTTTTAGGAGAAATAAGTATGTG : 2720992r : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 2133803r : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 3054062r : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGAACAGAGTTTTAGGAGAAATAAGTATGTG : 3193802r : CCTCCAATTTTTATTAGTTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 282ph2f : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 306 CCTcCAATTTTTATTag ttgcctactcctttggg acAGAGTTTTAGGAGAAATAAGTATGTG
gene sequences represent character data
characters are positions in the sequence (not all workers agree; some say one gene is one character)
character states are the nucleotides in the sequence (or amino acids in the case of proteins)
Problems:
the probability that two nucleotides are the same just by chance mutation is 25%
what to do with insertions or deletions (which may themselves be characters)
homoplasy in sequences may cause alignment errors
Taxonomy and phylogenetics
Phylogenetic trees
Homology and homoplasy
Cladistic versus phenetic analyses
Phylogenetics
Molecular data and homoplasy: Orthologs vs. Paralogs
When comparing gene sequences, it is important to distinguish between identical vs. merely similar genes in different organisms
Orthologs are homologous genes in different species with analogous functions
Paralogs are similar genes that are the result of a gene duplication
A phylogeny that includes both orthologs and paralogs is likely to be incorrect
Sometimes phylogenetic analysis is the best way to determine if a new gene is an ortholog or paralog to other known genes
Taxonomy and phylogenetics
Phylogenetic trees
Homology and homoplasy
Cladistic versus phenetic analyses
Phylogenetics
What are useful characters?
Use derived condition, not ancestral
- Synapomorphy (shared derived character): homologous traits share the same character state because it originated in their immediate common ancestor
- Plesiomorphy (shared ancestral character”): homologous traits share the same character state because they are inherited from a common distant ancestor
Taxonomy and phylogenetics
Phylogenetic trees
Homology and homoplasy
Cladistic versus phenetic analyses
a na lo g y
syna p o m o rp hy(sha re d d e rive d
c ha ra c te r)
p le sio m o rp hy(sha re d a nc e stra l
c ha ra c te r)
a uta p o m o rp hy(uniq ue d e rive d
c ha ra c te r)
Phenetic methods construct trees (phenograms) by considering the current states of characters without regard to the evolutionary history that brought the species to their current phenotypes;phenograms are based on overall similarity
Cladistic methods construct trees (cladograms) rely on assumptions about ancestral relationships as well as on current data;cladograms are based on character evolution (e.g. shared derived characters)
Within the field of taxonomy there are two different methods and philosophies of building phylogenetic trees: cladistic and phenetic
Cladistics is becoming the method of choice; it is considered to be more powerfuland to provide more realistic estimates, however, it is slower than phenetic algorithms
Phenetics versus cladistics
Phenetics vs. cladistics
An example
Phenetics vs. cladistics
Phenetic (overall similarity)
A
B
Coverall similarityoverall similarity
C B A
3
4
5
characteristics identity
critter A 4 limbs meta.kidney
hair endothermy vivip. nocloaca
placental
critter B 4 limbs meta.kidney
hair endothermy ovip. cloaca echidna
critter C 4 limbs meta.kidney
feathers endothermy ovip. cloaca bird
ancestor 4 limbs meta.kidney
nohair/feathers
ectothermy ovip. cloaca turtle
Phenetics vs. cladistics
Cladistics (character evolution; e.g. shared derived characters)
A
B
C
shared derived charactersshared derived characters
A B C
1
2
1
Model of sequence evolution
The problem
- A basic process in the evolution of a sequence is change in that sequence over time
- Now we are interested in a mathematical model to describe that
- It is essential to have such a model to understand the mechanisms of change and is required to estimate both the rate of evolution and the evolutionary history of sequences
260 * 280 * 300 * 320 0841r : CCTTCAATTTTTATT-----------------------AGAGTTTTAGGAGAAATAAGTATGTG : 2720992r : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 2133803r : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 3054062r : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGAACAGAGTTTTAGGAGAAATAAGTATGTG : 3193802r : CCTCCAATTTTTATTAGTTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 282ph2f : CCTCCAATTTTTATTAGCTTGCCTACTCCTTTGGGCACAGAGTTTTAGGAGAAATAAGTATGTG : 306 CCTcCAATTTTTATTag ttgcctactcctttggg acAGAGTTTTAGGAGAAATAAGTATGTG
Model of sequence evolution
Pyrimidine (C4N2H4) Purine (C5N4H4)
Nucleotide base + sugar + phosphate
O
sug a r
P OO -
O -
P O 4--
Guanine
AdenineThymine
Cytosine
5 ’
3 ’
3 ’
3 ’
3 ’
3 ’
5 ’
3 ’
3 ’
3 ’
3 ’
3 ’
A
C T
G
Models of sequence evolution
Examples
Jukes-Cantor model (1969)
All substitutions have an equal probability and base frequencies are equal
A
C T
G
Models of sequence evolution
Examples
Felsenstein (1981)
All substitutions have an equal probability, but there are unequal base frequencies
APurines
Purym idines C T
G
Models of sequence evolution
Examples
Kimura 2 parameter model (K2P) (1980)
Transitions and transversions have different probabilities
APurines
Purym idines C T
G
Models of sequence evolution
Examples
Hasegawa, Kishino & Yano (HKY) (1985)
Transitions and transversions have different probabilities,base frequencies are unequal
A
C T
G
Models of sequence evolution
Examples
General time reversible model (GTR)
Different probabilities for each substitution,base frequencies are unequal
A
C T
G
Models of sequence evolution
GTR
HKY
A
C T
G
A
C T
G
A
C T
G
A
C T
G
Jukes-Cantor
Felsenstein K2P
More models of sequence evolution …
Currently, there are more than 60 models described
- plus gamma distribution and invariable sites
- accuracy of models rapidly decreases for highly divergent sequences
- problem: more complicated models tend to be less accurate (and slower)
How to pick an appropriate model?
- use a maximum likelihood ratio test
- implemented in Modeltest 3.06 (Posada & Crandall, 1998)
More models of sequence evolution …
Example for Modeltest file
JC = 3158.0095
F81 = 3121.2188
K80 = 2994.6611
HKY = 2924.4182
TrNef = 2994.5491
TrN = 2923.6340
K81 = 2987.6548
K81uf = 2923.5620
TIMef = 2987.6196
TIM = 2922.9878
TVMef = 2983.3450
TVM = 2922.1970
SYM = 2983.3069
GTR = 2921.1187
A Equal base frequencies
Null model = JC -lnL0 = 3369.2803
Alternative model = F81 -lnL1 = 3342.5513
2(lnL1-lnL0) = 53.4580 df = 3
P-value = <0.000001
B
Model selected: TVM+G
-lnL = 2911.3660
C
helix
sheet
Did the Florida dentist infect his patients with HIV?
Taxonomy and phylogenetics
Phylogenetic trees
Homology and homoplasy
Cladistic versus phenetic analyses
DENTIST
DENTIST
Patient D
Patient F
Patient C
Patient A
Patient G
Patient BPatient E
Patient A
Local control 2
Local control 3
Local control 9
Local control 35
Local control 3
From Ou et al. (1992) and Page & Holmes (1998)
N oN o
N oN o
Yes