Phylogeny estimation under a model of linguistic character ... · •Comparison of different...

Phylogeny estimation under a model of linguistic character

evolution Tandy Warnow

Department of Computer Science University of Illinois at Urbana-Champaign

The Computational Historical Linguistics Projecthttp://web.engr.illinois.edu/~warnow/histling.html

Collaboration with Don Ringe beganin 1994; 17 papers since then, and two NSF grants.

Dataset generation by Ringe and Ann Taylor(then a postdoc with Ringe, now Senior Lecturerat York University).

Method development with Luay Nakhleh (thenmy student, now Chair and Professor atRice University), Steve Evans (Prof. Statistics, Berkeley). Simulation study with Francois Barbanson (then my postdoc).

Indo-European languages

From linguistica.tribe.net

Homoplasy-free evolution• When a character changes

state, it changes to a new state not in the tree; i.e., there is no homoplasy(character reversal or parallel evolution)

• First inferred for weird innovations in phonological characters and morphological characters in the 19th century, and used to establish all the major subgroups within IE

• (But…other characters also evolve without homoplasy)

0 0 0 1 1

0

1

0

0

Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

Anatolian

Tocharian

Greek

Armenian

Italic

Celtic

Albanian

Germanic

Baltic Slavic

VedicIranian

Another possible Indo-European tree (Gray & Atkinson, 2004)

Italic Gmc. Celtic Baltic Slavic Alb. Indic Iranian Armenian Greek Toch. Anatolian

Gray and Atkinson Method

• Used only lexical data• Binary-encoding of all multi-state character

data to perform the analysis• Assumes the CFN (Cavender-Farris-Neyman)

model of binary sequence evolution, in which all characters evolve identically and independently down the model tree

• Uses MCMC to sample from a distribution (specifically used MrBayes software)

This talk• Introduction to molecular systematics (very fast)• Linguistic data and the Ringe-Warnow analyses of the Indo-

European language family• Perfect Phylogenetic Networks (Nakhleh et al., Language 2005)• Comparison of different phylogenetic methods on Indo-European

datasets (Nakhleh et al., Transactions of the Philological Society 2005)

• Stochastic model of linguistic character evolution (Warnow et al., MacDonald Institute for Archaeological Research, 2006)

• Simulation study evaluating different phylogenetic methods (Barbançon et al., Diachronica 2013)

• Discussion and Future work

DNA Sequence Evolution (Idealized)

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT


AAGACTT

TGGACTTAAGGCCT


AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Markov Models of DNA Sequence Evolution

The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution).

Markov Models of DNA Sequence Evolution


Simplest site evolution model (Jukes-Cantor, 1969):

• The model tree T is binary and has substitution probabilities p(e) on each edge e.

• The state at the root is randomly drawn from {A,C,T,G} (nucleotides)

• If a site (position) changes on an edge, it changes with equal probability to each of the remaining states.

• The evolutionary process is Markovian.

More complex models (such as the General Markov model) are also considered, often with little change to the theory.

CFN Model of Binary Sequence Evolution


CFN model: (CFN: Cavender, Farris, Neyman):• The model tree T is binary and has substitution probabilities p(e) on each edge e.• The state at the root is randomly drawn from {0,1} (presence/absence)• If a site (position) changes on an edge, it changes with equal probability to each

of the remaining states.• The evolutionary process is Markovian.

Used for modelling trait evolution (presence/absence)

Markov Models in Biology

• All standard models have a finite number of states (2 for traits, 4 for DNA/RNA, 20 for aminoacids), and hence lots of homoplasy

• All characters (sites in an alignment) evolve identically and independently (iid) down the tree

Statistical Consistency and Identifiability

error

Data

“State of the Art” for Molecular Systematics

• Many distance-methods are statistically consistent (but not UPGMA)

• Parsimony is not statistically consistent• Maximum likelihood and Bayesian MCMC are

statistically consistent and highly favored

Model of Linguistic Character Evolution (Warnow, Evans, Ringe, and Nakhleh, 2006)

• Three types of characters: lexical, phonological, and morphological

• Infinite number of possible states• Character evolution is not iid• Borrowing can occur• Limited homoplasy enables identifiability• Statistically consistent methods presented

Our main points

• Biomolecular data evolve differently from linguistic data, and linguistic models and methods should not be based upon biological models.

• Better (more accurate) phylogenies can be obtained by formulating models and methods based upon linguistic scholarship, and using good data.

• All methods, whether explicitly based upon statistical models or not, need to be carefully tested.

Indo-European languages

From linguistica.tribe.net

Homoplasy-free evolution• When a character changes

state, it changes to a new state not in the tree; i.e., there is no homoplasy(character reversal or parallel evolution)

• First inferred for weird innovations in phonological characters and morphological characters in the 19th century, and used to establish all the major subgroups within IE 0 0 0 1 1

0

1

0

0

Historical Linguistic Data

• A character is a function that maps a set of languages, L, to a set of states.

• Three kinds of characters:– Phonological (sound changes)– Lexical (meanings based on a wordlist)– Morphological (especially inflectional)

Sound changes• Many sound changes are natural, and should not be used for

phylogenetic reconstruction.• Others are bizarre, or are composed of a sequence of simple

sound changes. These are useful for subgrouping purposes. Example: Grimm’s Law.

1. Proto-Indo-European voiceless stops change into voiceless fricatives.2. Proto-Indo-European voiced stops become voiceless stops.3. Proto-Indo-European voiced aspirated stops become voiced

fricatives.

An Indo-European lexical character: ‘hand’.

Data.

Hittite kissar Lithuanian rankà Old Prussian rānkan (acc.)

Armenian jeṙn Old English hand Latvian ròka

Greek χείρ /kʰé:r/ Old Irish lám Gothic handus

Albanian dorë Latin manus Old Norse hǫnd

Tocharian B ṣar Luvian īssaris OHG hant

Vedic hástas Lycian izredi (instr.) Welsh llaw

Avestan zastō Tocharian A tsar Oscan manim (acc.)

OCS rǫka Old Persian dasta Umbrian manf (acc. pl.)

Semantic slot for hand – coded (Partitioned into cognate classes)

Proto-Indo-European *plh2meh2 ‘flat hand’ (cf. Homeric Greek palámɛ:) > Proto-Celtic *lāmā ‘hand’ > Old Irish lám > Welch llaw Proto-Germanic *handuz ‘hand’ > Gothic handus >→ Runic Norse *handu (ending influenced by a different noun class) > Old Norse hǫnd > Proto-West Germanic *handu > Old English hand > Old High German hant Proto-Italic *man- ‘hand’ > Latin manus (transferred into the u-stems) >→ Proto-Sabellian *man- > Oscan *manis > *mans, accusative manim (transf. into the i-stems) > Umbrian *man-, accusative plural manf

Lexical characters can also evolve without homoplasy

• For every cognate class, the nodes of the tree in that class should form a connected subset - as long as there is no undetected borrowing nor parallel semantic shift.

0 0 1 1 2

1

1

1

0

Our (RWT) Data

• Ringe & Taylor (2002)– 259 lexical – 13 morphological – 22 phonological

• These data have cognate judgments estimated by Ringe and Taylor, and vetted by other Indo-Europeanists. (Alternate encodings were tested, and mostly did not change the reconstruction.)

• Polymorphic characters, and characters known to evolve in parallel, were removed.

Differences between different characters

• Lexical: most easily borrowed (most borrowings detectable), and homoplasy relatively frequent (we estimate about 25-30% overall for our wordlist, but a much smaller percentage for basic vocabulary).

• Phonological: can still be borrowed but much less likely than lexical. Complex phonological characters are infrequently (if ever) homoplastic, although simple phonological characters very often homoplastic.

• Morphological: least easily borrowed, least likely to be homoplastic.

Our methods/models

• Ringe & Warnow �Almost Perfect Phylogeny�: most characters evolve without homoplasy under a no-common-mechanism assumption (various publications since 1995)

• Ringe, Warnow, & Nakhleh �Perfect Phylogenetic Network�: extends APP model to allow for borrowing, but assumes homoplasy-free evolution for all characters (Language, 2005)

• Warnow, Evans, Ringe & Nakhleh �Extended Markov model�: parameterizes PPN and allows for homoplasy provided that homoplastic states can be identified from the data (MacDonald Institute for Archaeological Research, 2006)

First analysis: Almost Perfect Phylogeny

• The original dataset contained 375 characters (336 lexical, 17 morphological, and 22 phonological).

• We screened the dataset to eliminate characters likely to evolve homoplastically or by borrowing.

• On this reduced dataset (259 lexical, 13 morphological, 22 phonological), we attempted to maximize the number of compatible characters while requiring that certain of the morphological and phonological characters be compatible. (Computational problem NP-hard.)

Indo-European Tree(95% of the characters compatible)

Anatolian

Tocharian

Greek

Armenian

Italic

Celtic

Albanian

Germanic

Baltic Slavic

VedicIranian

Second attempt: PPN• We explain the remaining incompatible characters by inferring

previously undetected �borrowing�.• We attempted to find a PPN (perfect phylogenetic network)

with the smallest number of contact edges, borrowing events, and with maximal feasibility with respect to the historical record. (Computational problems NP-hard).

• Our analysis produced one solution with only three contact edges that optimized each of the criteria. Two of the contact edges are well-supported.

Modelling borrowing: Networks and Trees within Networks

�Perfect Phylogenetic Network�(all characters compatible)

Anatolian

Tocharian

Greek

Armenian

Albanian

Germanic

Baltic Slavic

VedicIranian Italic

Celtic

L. Nakhleh, D. Ringe, and T. Warnow, LANGUAGE, 2005

http://www.lsadc.org/language/

Comments• This network is very �tree-like� (only three

contact edges needed to explain the data.• Two of the three contact edges are strongly

supported by the data (many characters are borrowed).

• If the third contact edge is removed, then the evolution of the remaining (two) incompatible characters needs to be explained. Probably this is parallel semantic shift.

Other IE analysesNote: many reconstructions of IE have been done, but produce

different histories which differ in significant ways

Possible issues:Dataset (modern vs. ancient data, errors in the cognancy

judgments, lexical vs. all types of characters, screened vs. unscreened)

Translation of multi-state data to binary dataReconstruction method



Only lexical data, not well curatedBinary encoding!

Gray and Atkinson Method

• Uses MCMC to sample from a distribution (specifically used MrBayes software)

• Assumes the CFN (Cavender-Farris-Neyman) model of binary sequence evolution, in which all characters evolve identically and independently down the model tree

• Used only lexical data• Binary-encoding of all multi-state character

data to perform the analysis

Semantic slot for hand – coded (Partitioned into cognate classes)

Binary Encoding of Multi-state Characters

• Imagine you have a lexical character (semantic slot) with four cognate classes

• This is replaced by four characters, one for each cognate class• The four characters are now assumed to evolve i.i.d. under the CFN

model of binary character evolution – 0 represents absence, – 1 represents presence

• Problems with binary encoding:– (0,0,0,0 can occur), indicating no word for this semantic slot– (1,1,1,1), indicating all four words for the semantic slot (i.e., the model

allows unbounded polymorphism)• Problem with CFN model applied to binary encoding

– Model assumes appearance and loss of cognates are independent of other cognates (unrealistic)



Only lexical data, not well curatedBinary encoding!

Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

Anatolian

Tocharian

Greek

Armenian

Italic

Celtic

Albanian

Germanic

Baltic Slavic

VedicIranian

Three types of data, highly curatedMulti-state!

Different methods/datagive different answers.

We don’t know which answer is correct.

Which method(s)/datashould we use?

The performance of methods on an IE data set, Transactions of the Philological Society,

L. Nakhleh, T. Warnow, D. Ringe, and S.N. Evans, 2005)

Observation: Different datasets (not just different methods) can give different reconstructed phylogenies.

Objective: Explore the differences in reconstructions as a function of data (lexical alone versus lexical, morphological, and phonological), screening (to remove obviously homoplastic characters), and methods. However, we use a better basic dataset (where cognancyjudgments are more reliable).

Phylogeny reconstruction methods• Perfect Phylogenetic Networks (Ringe, Warnow, and Nakhleh)

• Other network methods

• Neighbor joining (distance-based method)

• UPGMA (distance-based method, same as glottochronology)

• Maximum parsimony (minimize number of changes)

• Maximum compatibility (weighted and unweighted)

• Gray and Atkinson (Bayesian estimation based upon presence/absence of cognates, as described in Nature 2003)

Four IE datasets

Ringe & Taylor

• The screened full dataset of 294 characters (259 lexical, 13 morphological, 22 phonological)

• The unscreened full dataset of 336 characters (297 lexical, 17 morphological, 22 phonological)

• The screened lexical dataset of 259 characters.

• The unscreened lexical dataset of 297 characters.

Results: Likely Subgroups

Other than UPGMA, all methods reconstruct

• the ten major subgroups

• Anatolian + Tocharian (that under the assumption that Anatolian is the first daughter, then Tocharian is the second daughter)

• Greco-Armenian (that Greek and Armenian are sisters)

Nothing else is consistently reconstructed.

In particular, the choice of data (lexical only, or also morphology and phonological) has an impact on the final tree.

The choice of method also has an impact!differ significantly on the datasets, and from each other.

Other observations

• UPGMA (i.e., the tree-building technique for glottochronology) does the worst (e.g. splits Italic and Iranian groups).

• The Satem Core (Indo-Iranian plus Balto-Slavic) is not always reconstructed.

• Almost all analyses put Italic, Celtic, and Germanic together:– The only exception is Weighted Maximum Compatibility

on datasets that include highly weighted morphological characters.ffer significantly on the datasets, and from each other.

Different methods/datagive different answers.

We don’t know which answer is correct.

Which method(s)/datashould we use?

Stochastic model of language evolution (Warnow et al., 2006)

Model based on linguistic scholarship

• Borrowing between languages

• Each character evolves with its own parameters (not identically to the other characters)

• When a character changes state, it usually attains a new state (so that the number of states is unbounded)

• Some homoplasy is allowed (but just one known homoplastic state per character)

This is a statistically identifiable model (if not too much borrowing)

Barbancon et al. Diachronica 2013Simulation study

Data (all generated under the Warnow et al. model) on phylogenetic networks (model trees with borrowing edges)• Lexical and morphological characters

• Phylogenetic networks with 30 leaves and 0-3 contact edges

• �Moderate homoplasy�: – morphology: 24% homoplastic, no borrowing

– lexical: 13% homoplastic, 7% borrowing

• �Low homoplasy�: – morphology: no borrowing, no homoplasy;

– lexical: 1% homoplastic, 6% borrowing

Barbanson et al. Diachronica 2013

• Methods tested with respect to error in the estimated tree (compared to the “true tree”, which underlies the phylogenetic network)

• Methods: – NJ and UPGMA (distance-based)– Gray and Atkinson (G&A)–Maximum Parsimony (MP), Weighted Maximum

Parsimony (WMP), and Weighted Maximum Compatibility (WMC)

Simulation study – sample of results

Figure 7: Impact of the number of contact edges on phylogenetic reconstruction methods for 300 lexical characters and 60 morphological characters, under two levels of homoplasy (moderate on the left and low on the right). All datasets evolve under a moderate deviation from a lexical clock (dlc = 0.3) and moderate deviation from the rates-across-sites assumption (het = 1.2).

Figure 8: Impact of the deviation from the rates-across-sites assumption on phylogenetic reconstruction methods, for 300 lexical characters and 60 morphological characters, under two levels of homoplasy (moderate on the left and low on the right). All characters evolve down a phylogenetic network with three contact edges under a moderate deviation from a lexical clock (dlc = 0.3). We vary het, the parameter for deviating from the rates-across-sites assumption, from low (0.6) to moderate (1.8).

4.9 Summary

Our study showed the following:

• There was a consistent pattern of relative accuracy of phylogenies reconstructed using these methods, with UPGMA worst, followed by neighbor joining, then G&A, then MP. The relative performance of WMP and WMC depended upon the amount of homoplasy in the high weight characters, and so was excellent (comparable to that of MP) for the low homoplasy conditions and poor for the moderate homoplasy conditions.

• Deviating from the lexical clock made all methods somewhat worse, but had the

biggest impact on UPGMA.

19

Figure 7: Impact of the number of contact edges on phylogenetic reconstruction methods for 300 lexical characters and 60 morphological characters, under two levels of homoplasy (moderate on the left and low on the right). All datasets evolve under a moderate deviation from a lexical clock (dlc = 0.3) and moderate deviation from the rates-across-sites assumption (het = 1.2).

Figure 8: Impact of the deviation from the rates-across-sites assumption on phylogenetic reconstruction methods, for 300 lexical characters and 60 morphological characters, under two levels of homoplasy (moderate on the left and low on the right). All characters evolve down a phylogenetic network with three contact edges under a moderate deviation from a lexical clock (dlc = 0.3). We vary het, the parameter for deviating from the rates-across-sites assumption, from low (0.6) to moderate (1.8).

4.9 Summary

Our study showed the following:

• There was a consistent pattern of relative accuracy of phylogenies reconstructed using these methods, with UPGMA worst, followed by neighbor joining, then G&A, then MP. The relative performance of WMP and WMC depended upon the amount of homoplasy in the high weight characters, and so was excellent (comparable to that of MP) for the low homoplasy conditions and poor for the moderate homoplasy conditions.

• Deviating from the lexical clock made all methods somewhat worse, but had the

biggest impact on UPGMA.

19

Varying deviation from i.i.d. character evolutionVarying number of contact edges

Observations

1. Choice of data does matter (good idea to add morphological

characters, and to screen well).

2. Accuracy only slightly lessened with small increases in

homoplasy, borrowing, or deviation from the lexical clock. Some

amount of heterotachy (deviation from i.i.d.) improves accuracy.

3. Relative performance between methods consistently shows:

– Distance-based methods least accurate

– Gray and Atkinson’s method middle accuracy

– Parsimony and Compatibility methods most accurate

Markov Models in Molecular Evolution

• All standard models have a finite number of states (and hence lots of homoplasy)

• All sites evolve iid down the tree• No borrowing (horizontal gene transfer)

Markov Model in Linguistic Evolution

Warnow et al. model:• Infinite number of states (and restricted

homoplasy)• Heterogeneity of character evolution (not iid)• Borrowing between languages

“State of the Art” for Molecular Systematics

• The most favored methods are Maximum likelihood and Bayesian MCMC, under parametric models of sequence evolution that assume• iid evolution across sites• Finite number of states (4 for DNA, 20 for amino

acids, 2 for traits) – and so lots of homoplasy• No borrowing

Our main points

• Biomolecular data evolve differently from linguistic data, and linguistic models and methods should not be based upon biological models.

• Better (more accurate) phylogenies can be obtained by formulating models and methods based upon linguistic scholarship, and using good data.

• All methods, whether explicitly based upon statistical models or not, need to be carefully tested.

Future research

• We need more investigation of statistical methods based on good stochastic models, as these are now the methods of choice in biology.

• This requires realistic parametric models of linguistic evolution and method development under these parametric models!

Acknowledgements

• Financial Support: The David and Lucile Packard Foundation, The National Science Foundation, The Program for Evolutionary Dynamics at Harvard, and The Radcliffe Institute for Advanced Studies

• Collaborators: Don Ringe, Steve Evans, LuayNakhleh, and Francois Barbançon

• Please see http://tandy.cs.illinois.edu/histling.html

http://www.cs.utexas.edu/users/tandy/histling.html

Modelling issues• What are the units?• Polymorphism • Homoplasy• Non-treelike evolution• Non-i.i.d. evolution and violations of the rates-across-sites assumption

(heterotachy)• Deviation from the lexical clock (is dating even really possible?)

Note:• The statistical model of Warnow, Evans, Ringe, and Nakhleh has homoplasy and

reticulation, but no polymorphism – independent evolution, but not identical acrosscharacters

• The bag-of-words model of Nicholls and Gray (J R. Statist. Soc.B 2008) allows for polymorphism but no reticulation, and has homoplasy in the form of parallel back-muation, iid evolution across characters

• The model of Gray and Atkinson (binary-encoding) has unlimited polymorphism and homoplasy, but no reticulation, iid characters

Phylogeny estimation under a model of linguistic character ... · •Comparison of different...

Documents

Transcript of Phylogeny estimation under a model of linguistic character ... · •Comparison of different...