Automated reconstruction of ancient languages using ...bouchard/pub/Bouchard2013Automated.pdf ·...
Transcript of Automated reconstruction of ancient languages using ...bouchard/pub/Bouchard2013Automated.pdf ·...
Automated reconstruction of ancient languages usingprobabilistic models of sound changeAlexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb
aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,University of California, Berkeley, CA 94720
Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for reviewMarch 19, 2012)
One of the oldest problems in linguistics is reconstructing thewords that appeared in the protolanguages from which modernlanguages evolved. Identifying the forms of these ancient lan-guages makes it possible to evaluate proposals about the natureof language change and to draw inferences about human history.Protolanguages are typically reconstructed using a painstakingmanual process known as the comparative method. We present afamily of probabilistic models of sound change as well as algo-rithms for performing inference in these models. The resultingsystem automatically and accurately reconstructs protolanguagesfrommodern languages. We apply this system to 637 Austronesianlanguages, providing an accurate, large-scale automatic reconstruc-tion of a set of protolanguages. Over 85% of the system’s recon-structions are within one character of the manual reconstructionprovided by a linguist specializing in Austronesian languages. Be-ing able to automatically reconstruct large numbers of languagesprovides a useful way to quantitatively explore hypotheses about thefactors determining which sounds in a language are likely to changeover time. We demonstrate this by showing that the reconstructedAustronesian protolanguages provide compelling support for a hy-pothesis about the relationship between the function of a soundand its probability of changing that was first proposed in 1955.
ancestral | computational | diachronic
Reconstruction of the protolanguages from which modern lan-guages are descended is a difficult problem, occupying histor-
ical linguists since the late 18th century. To solve this problemlinguists have developed a labor-intensivemanual procedure calledthe comparative method (1), drawing on information about thesounds and words that appear in many modern languages to hy-pothesize protolanguage reconstructions even when no writtenrecords are available, opening one of the few possible windowsto prehistoric societies (2, 3). Reconstructions can help in un-derstanding many aspects of our past, such as the technologicallevel (2), migration patterns (4), and scripts (2, 5) of early societies.Comparing reconstructions across many languages can help revealthe nature of language change itself, identifying which aspectsof language are most likely to change over time, a long-standingquestion in historical linguistics (6, 7).In many cases, direct evidence of the form of protolanguages is
not available. Fortunately, owing to the world’s considerable lin-guistic diversity, it is still possible to propose reconstructions byleveraging a large collection of extant languages descended from asingle protolanguage. Words that appear in these modern lan-guages can be organized into cognate sets that contain words sus-pected to have a shared ancestral form (Table 1). The keyobservation that makes reconstruction from these data possible isthat languages seem to undergo a relatively limited set of regularsound changes, each applied to the entire vocabulary of a languageat specific stages of its history (1). Still, several factors make re-construction a hard problem. For example, sound changes are oftencontext sensitive, and many are string insertions and deletions.In this paper, we present an automated system capable of
large-scale reconstruction of protolanguages directly from wordsthat appear in modern languages. This system is based on aprobabilistic model of sound change at the level of phonemes,
building on work on the reconstruction of ancestral sequencesand alignment in computational biology (8–12). Several groupshave recently explored how methods from computational biologycan be applied to problems in historical linguistics, but such workhas focused on identifying the relationships between languages(as might be expressed in a phylogeny) rather than reconstructingthe languages themselves (13–18). Much of this type of work hasbeen based on binary cognate or structural matrices (19, 20), whichdiscard all information about the form that words take, simply in-dicating whether they are cognate. Such models did not have thegoal of reconstructing protolanguages and consequently use a rep-resentation that lacks the resolution required to infer ancestralphonetic sequences. Using phonological representations allows usto perform reconstruction and does not require us to assume thatcognate sets have been fully resolved as a preprocessing step. Rep-resenting the words at each point in a phylogeny and having a modelof how they change give a way of comparing different hypothesizedcognate sets and hence inferring cognate sets automatically.The focus on problems other than reconstruction in previous
computational approaches has meant that almost all existingprotolanguage reconstructions have been done manually. How-ever, to obtain more accurate reconstructions for older languages,large numbers of modern languages need to be analyzed. TheProto-Austronesian language, for instance, has over 1,200 de-scendant languages (21). All of these languages could potentiallyincrease the quality of the reconstructions, but the number ofpossibilities increases considerably with each language, making itdifficult to analyze a large number of languages simultaneously.The few previous systems for automated reconstruction of pro-tolanguages or cognate inference (22–24) were unable to handlethis increase in computational complexity, as they relied on de-terministic models of sound change and exact but intractablealgorithms for reconstruction.Being able to reconstruct large numbers of languages also
makes it possible to provide quantitative answers to questionsabout the factors that are involved in language change. We dem-onstrate the potential for automated reconstruction to lead tonovel results in historical linguistics by investigating a specifichypothesized regularity in sound changes called functional load.The functional load hypothesis, introduced in 1955, asserts thatsounds that play a more important role in distinguishing words areless likely to change over time (6). Our probabilistic reconstructionof hundreds of protolanguages in the Austronesian phylogenyprovides a way to explore this question quantitatively, producingcompelling evidence in favor of the functional load hypothesis.
Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H.performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C.,D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. N.C. is a guest editor invited by the EditorialBoard.
Freely available online through the PNAS open access option.
See Commentary on page 4159.1To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1204678110/-/DCSupplemental.
4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1204678110
ModelWe use a probabilistic model of sound change and a Monte Carloinference algorithm to reconstruct the lexicon and phonology ofprotolanguages given a collection of cognate sets from modernlanguages. As in other recent work in computational historicallinguistics (13–18), we make the simplifying assumption that eachword evolves along the branches of a tree of languages, reflectingthe languages’ phylogenetic relationships. The tree’s internalnodes are languages whose word forms are not observed, and theleaves are modern languages. The output of our system is a pos-terior probability distribution over derivations. Each derivationcontains, for each cognate set, a reconstructed transcription ofancestral forms, as well as a list of sound changes describing thetransformation from parent word to child word. This represen-tation is rich enough to answer a wide range of queries that wouldnormally be answered by carrying out the comparative methodmanually, such as which sound changes were most prominentalong each branch of the tree.We model the evolution of discrete sequences of phonemes,
using a context-dependent probabilistic string transducer (8).Probabilistic string transducers efficiently encode a distributionover possible changes that a string might undergo as it changesthrough time. Transducers are sufficient to capture most types ofregular sound changes (e.g., lenitions, epentheses, and elisions) andcan be sensitive to the context in which a change takes place. Mosttypes of changes not captured by transducers are not regular (1) andare therefore less informative (e.g., metatheses, reduplications, andhaplologies). Unlike simple molecular InDel models used in com-putational biology such as the TKF91 model (25), the parameter-ization of our model is very expressive: Mutation probabilities arecontext sensitive, depending on the neighboring characters, andeach branch has its own set of parameters. This context-sensitiveand branch-specific parameterization plays a central role in oursystem, allowing explicit modeling of sound changes.Formally, let τ be a phylogenetic tree of languages, where each
language is linked to the languages that descended from it. Insuch a tree, the modern languages, whose word forms will beobserved, are the leaves of τ. The most recent common ancestorof these modern languages is the root of τ. Internal nodes of thetree (including the root) are protolanguages with unobservedword forms. Let L denote all languages, modern and otherwise.All word forms are assumed to be strings in the InternationalPhonetic Alphabet (IPA).We assume that word forms evolve along the branches of the
tree τ. However, it is usually not the case that a word belongingto each cognate set exists in each modern language—words arelost or replaced over time, meaning that words that appear in theroot languages may not have cognate descendants in the lan-guages at the leaves of the tree. For the moment, we assume thereis a known list of C cognate sets. For each c∈ f1; . . . ;Cg let LðcÞ
denote the subset of modern languages that have a word form inthe cth cognate set. For each set c∈ f1; . . . ;Cg and each languageℓ∈LðcÞ, we denote the modern word form by wcℓ. For cognate setc, only the minimal subtree τðcÞ containing LðcÞ and the root isrelevant to the reconstruction inference problem for that set.Our model of sound change is based on a generative process
defined on this tree. From a high-level perspective, the genera-tive process is quite simple. Let c be the index of the currentcognate set, with topology τðcÞ. First, a word is generated for theroot of τðcÞ, using an (initially unknown) root language model(i.e., a probability distribution over strings). The words that ap-pear at other nodes of the tree are generated incrementally, usinga branch-specific distribution over changes in strings to generateeach word from the word in the language that is its parent in τðcÞ.Although this distribution differs across branches of the tree,making it possible to estimate the pattern of changes involved inthe transition from one language to another, it remains the samefor all cognate sets, expressing changes that apply stochasticallyto all words. The probabilities of substitution, insertion and de-letion are also dependent on the context in which the changeoccurs. Further details of the distributions that were used andtheir parameterization appear in Materials and Methods.The flexibility of our model comes at the cost of having literally
millions of parameters to set, creating challenges not found inmost computational approaches to phylogenetics. Our inferencealgorithm learns these parameters automatically, using estab-lished principles from machine learning and statistics. Specifi-cally, we use a variant of the expectation-maximization algorithm(26), which alternates between producing reconstructions on thebasis of the current parameter estimates and updating the pa-rameter estimates on the basis of those reconstructions. Thereconstructions are inferred using an efficient Monte Carlo in-ference algorithm (27). The parameters are estimated by opti-mizing a cost function that penalizes complexity, allowing us toobtain robust estimates of large numbers of parameters. See SIAppendix, Section 1 for further details of the inference algorithm.If cognate assignments are not available, our system can be
applied just to lists of words in different languages. In this case itautomatically infers the cognate assignments as well as thereconstructions. This setting requires only two modifications tothe model. First, because cognates are not available, we index thewords by their semantic meaning (or gloss) g, and there are thusG groups of words. The model is then defined as in the previouscase, with words indexed as wgℓ. Second, the generation process isaugmented with a notion of innovation, wherein a word wgℓ′ insome language ℓ′ may instead be generated independently fromits parent word wgℓ. In this instance, the word is generated froma language model as though it were a root string. In effect, thetree is “cut” at a language when innovation happens, and so theword begins anew. The probability of innovation in any given
Table 1. Sample of reconstructions produced by the system
*Complete sets of reconstructions can be found in SI Appendix.†Randomly selected by stratified sampling according to the Levenshtein edit distance Δ.‡Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42).§The colors encode cognate sets.{We use this symbol for encoding missing data.
Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225
COMPU
TERSC
IENCE
SPS
YCHOLO
GICALAND
COGNITIVESC
IENCE
SSE
ECO
MMEN
TARY
language is initially unknown and must be learned automaticallyalong with the other branch-specific model parameters.
ResultsOur results address three questions about the performance of oursystem. First, how well does it reconstruct protolanguages? Second,howwell does it identify cognate sets? Finally, how can this approachbe used to address outstanding questions in historical linguistics?
Protolanguage Reconstructions. To test our system, we applied it toa large-scale database of Austronesian languages, the Austrone-sian Basic Vocabulary Database (ABVD) (28). We used a pre-viously established phylogeny for these languages, the Ethnologuetree (29) (we also describe experiments with other trees in Fig. 1).For this first test of our system we also used the cognate setsprovided in the database. The dataset contained 659 languages atthe time of download (August 7, 2010), including a few languagesoutside the Austronesian family and some manually reconstructedprotolanguages used for evaluation. The total data comprised142,661 word forms and 7,708 cognate sets. The goal was to re-construct the word in each protolanguage that corresponded toeach cognate set and to infer the patterns of sound changes alongeach branch in the phylogeny. See SI Appendix, Section 2 forfurther details of our simulations.We used the Austronesian dataset to quantitatively evaluate the
performance of our system by comparing withheld words fromknown languages with automatic reconstructions of those words.The Levenshtein distance between the held-out and reconstructedforms provides a measure of the number of errors in thesereconstructions. We used this measure to show that using morelanguages helped reconstruction and also to assess the overallperformance of our system. Specifically, we compared the system’serror rate on the ancestral reconstructions to a baseline and also tothe amount of divergence between the reconstructions of twolinguists (Fig. 1A). Given enough data, the system can achievereconstruction error rates close to the level of disagreement be-tween manual reconstructions. In particular, most reconstructionsperfectly agree with manual reconstructions, and only a few con-tain big errors. Refer to Table 1 for examples of reconstructions.See SI Appendix, Section 3 for the full lists.We also present in Fig. 1B the effect of the tree topology on
reconstruction quality, reiterating the importance of using in-formative topologies for reconstruction. In Fig. 1C, we show thatthe accuracy of our method increases with the number of ob-served Oceanic languages, confirming that large-scale inferenceis desirable for automatic protolanguage reconstruction: Recon-struction improved statistically significantly with each increase
except from 32 to 64 languages, where the average edit distanceimprovement was 0.05.For comparison, we also evaluated previous automatic re-
construction methods. These previous methods do not scale tolarge datasets so we performed comparisons on smaller subsetsof the Austronesian dataset. We show in SI Appendix, Section 2that our method outperforms these baselines.We analyze the output of our system in more depth in Fig. 2
A–C, which shows the system learned a variety of realistic soundchanges across the Austronesian family (30). In Fig. 2D, we showthe most frequent substitution errors in the Proto-Austronesianreconstruction experiments. See SI Appendix, Section 5 fordetails and similar plots for the most common incorrect inser-tions and deletions.
Cognate Recovery. Previous reconstruction systems (22) requiredthat cognate sets be provided to the system. However, the crea-tion of these large cognate databases requires considerable an-notation effort on the part of linguists and often requires that atleast some reconstruction be done by hand. To demonstrate thatour model can accurately infer cognate sets automatically, weused a version of our system that learns which words are cognate,starting only from raw word lists and their meanings. This systemuses a faster but lower-fidelity model of sound change to infercorrespondences. We then ran our reconstruction system oncognate sets that our cognate recovery system found. See SI Ap-pendix, Section 1 for details.This version of the system was run on all of the Oceanic lan-
guages in the ABVD, which comprise roughly half of the Aus-tronesian languages. We then evaluated the pairwise precision(the fraction of cognate pairs identified by our system that are alsoin the set of labeled cognate pairs), pairwise recall (the fraction oflabeled cognate pairs identified by our system), and pairwise F1measure (defined as the harmonic mean of precision and recall)for the cognates found by our system against the known cognatesthat are encoded in the ABVD. We also report cluster purity,which is the fraction of words that are in a cluster whose knowncognate group matches the cognate group of the cluster. See SIAppendix, Section 2.3 for a detailed description of the metrics.Using these metrics, we found that our system achieved a pre-
cision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of0.918. Thus, over 9 of 10 words are correctly grouped, and oursystem errs on the side of undergrouping words rather than clus-tering words that are not cognates. Because the null hypothesis inhistorical linguistics is to deem words to be unrelated unlessproved otherwise, a slight undergrouping is the desired behavior.
Rec
onst
ruct
ion
erro
r ra
te
A
0
0.125
0.250
0.375
0.500
Randommodern
Automated reconstruction
Agreement between two linguists
YakanBajoMapun
SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKa laManoboSaraBinukid
MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray
ButuanonTausugJoloSurigaonon
AklanonBisCebuanoKa laganMansaka
TagalogAntBikolNagaCTagbanwaKa
TagbanwaAb
BatakPalawPalawanBat
BanggiSWPa lawano
HanunooPalauanSinghi
DayakBakat
KatinganDayakNgaju
Kadorih
MerinaMalaMaanyan
TunjungKerinc i
Me layuSara
Me layuOganLowMalay
Indones ian
BanjareseM
Ma layBahas
MelayuBrun
Minangkaba
PhanRangCh
ChruHainanCham
MokenIbanGayo
Idaan
BunduDusun
TimugonM
ur
KelabitBar
Bintulu
BerawanLon
Belait
KenyahLong
MelanauM
uk
Lahanan
EngganoMal
EngganoBan
Nias
TobaBatak
Madurese
Iv atanBas c
Itbayat
Babuyan
Iraralay
Itbayaten
Is amorong
Ivasay
Imorod
Yami
SambalBoto
Kapampanga
IfugaoBayn
Ka llahanKa
KallahanKe
Inibaloi
Pangas inan
KakidugenI
IlongotKak
IfugaoAmga
IfugaoBata
BontokGuin
BontocGuin
KankanayNo
Balangaw
KalingaGui
ItnegBinon
Gaddang
AttaPamplo
IsnegDibag
AgtaDum
agatCas
IlokanoYogyaOldJavanesW
esternMa layoPolynes ian
BugineseSo
Maloh
MakassarTaeSToraja
SangirTabu
SangirSangilSaraBantikKa idipang
GorontaloHBolaangM
onPopaliaBonerateW
unaM
unaKatobuTontem
boan
Tons ea
Bang
gaiW
diBa
ree
Wol
ioM
ori
Kaya
nUm
aJu
Buka
tPu
nanK
ela i
Wam
par
Yabe
m
Num
bam
iSib
Mou
kAr
ia
Lam
ogai
Mul
Amar
a
Seng
seng
Kaul
ongA
uVBi
libil
Meg
iar
Geda
ged
Mat
ukar
Bilia
u
Men
gen
Mal
euKo
veTa
rpia
Sobe
i
Kayu
pula
uKRi
wo
Kairi
ruKis
Wog
eoAli
Auhe
lawa
Salib
a
Suau
Mai
s in
Gum
awan
a
Diod
io
Mol
ima
Ubir
Gapa
paiw
a
Wed
au
Dobu
an
Mis
ima
Kiliv
ila
Gaba
diDo
uraKu
niRo
ro
Mek
eoLala
Vilir
upu
Mot
u
Mag
oriS
out
Kand
asTang
aS iar
Kuan
ua
Patp
atar
Rov i
ana
S im
boNduk
e
Kus a
gheHoav
a
Lung
gaLuqa
Kubokota
Ughele
Ghanongga
Vangunu
Marovo
MbarekeSolosTaiof
Teop
NehanHapeNis sanHaku
Sis ingga
BabatanaTu
VarisiGhonVaris
iRirioSengga
BabatanaAvVaghua
Babatana
BabatanaLo
BabatanaKaLungaLunga
MonoMonoAluTorauUruava
MonoFauroBilurChekeHolo
MaringeKmaMaringeTatNggaoPoroMaringeLel
ZabanaKiaLaghuSamas
BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas
BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli
Talis eMalaGhariNdiGhariToloTalis e
Ma langoGhariNggaeMbirao
GhariTandaTalis ePoleGhariNggerTalis eKoo
GhariNginiLengo
LengoGhaimLengoParip
Nggela
ToambaitaKwai
LangalangaKwaio
MbaengguuLau
Mbae le lea
KwaraaeSolFata leka
LauWaladeLauNorthLonggu
SaaUkiNiMa
SaaSaaVillOroha
SaaUlawaSaa
Dorio
AreareMaas
AreareWaia
SaaAuluVil
BauroBaroo
BauroHaunu
Fagani
KahuaMami
SantaCatal
Aros iOneib
Aros iTawatAros i
Kahua
BauroPawaV
FaganiAguf
FaganiRihu
SantaAna
Tawaroga
TannaSouth
Kwamera
Lenakel
SyeErromanUra
ArakiSouthMere i
AmbrymSoutMota
PeteraraMa
PaameseSou
Mwotlap
Raga
SouthEfate
Orkon
Nguna
Namakir
Nati
Nahavaq
NeseTape
AnejomAnei
SakaoPortO
AvavaNeveei
Naman
Puluwatese
PuloAnnanW
oleaian
Satawalese
ChuukeseAKM
ortlockes
Carolinian
ChuukeseSonsoroles
Sa ipanCaroPuloAnnaW
oleai
Mokilese
PingilapesPonapeanKiribatiKus a ie
Marsha lles
NauruNelem
waJawe
CanalaIaai
NengoneDehu
Rotuman
WesternF ij
Maori
TahitiPenrhyn
Tuamotu
Manihiki
RurutuanTahitianM
oRarotongan
TahitianthM
arquesanM
angarevaHawaiian
RapanuiEasSam
oanBellonaTikopia
Emae
AnutaRennellese
FutunaAniwIfiraM
eleMUveaW
estFutunaEas t
VaeakauTauTakuu
TuvaluSikaiana
Kapingamar
NukuoroLuangiua
PukapukaTokelau
UveaEastTongan
NiueFijianBau
Tean
uVa
noTa
nem
aBu
ma
Asum
boa
Tani
mbi
liNe
mba
oSe
imat
Wuv
ulu
Lou
Naun
aLe
ipon
S iv i
saTi
taL i
kum
Leve
iLo
niu
Mus
sau
Kas i
raIra
hGi
man
Buli
Mor
Min
yaifu
inBi
gaM
isoo
lAs War
open
Num
for
Mar
auAm
baiY
apen
Win
des i
Wan
North
Baba
rDa
wera
Dawe
Dai
Tela
Mas
bua
Imro
ing
Empl
awas
Eas t
Mas
e la
Seril
iCe
ntra
lMas
Sout
hEas
tBW
estD
amar
Tara
ngan
Ba
UjirN
Aru
Ngai
borS
Ar
Gese
rW
atub
ela
E lat
KeiB
es
Hitu
Ambo
n
SouA
man
aTe
Amah
aiPa
uloh
iAl
une
Mur
nate
nAl
Bonf
iaW
erin
ama
Buru
Nam
rol
Sobo
yoM
ambo
ruM
angg
arai
Ngad
haPo
ndok
Kodi
Wej
ewaT
ana
Bim
aSo
aSa
vuW
anuka
ka
Gaur
aNgg
au
Eas tSumban
Baliled
o
Lamboya
L ioF loresT
Ende
NageKambera
PalueNitun
Anakalang
Sekar
PulauArgun
RotiTerm
an
Atoni
Mambai
TetunTerik
Kemak
Lamalera le
LamaholotI
S ika
Kedang
Era i
TalurAputai
Perai
Tugun
Iliun
Serua
NilaTeunKis ar
Roma
Eas tDamar
LetineseYamdena
KeiTanimba
Se laru
KoiwaiIriaChamorro
Lampung
Komering
Sunda
RejangRe ja
TboliTagab
TagabiliKoronadalB
SaranganiB
BilaanKoroBilaanSara
CiuliAtayaSquliqAtay
SediqBunun
ThaoFavorlang
RukaiPaiwanKanakanabu
SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya
PazehSa is ia t
YakanBajoMapun
SamalS ias iInabaknonSubanunSinSubanonSioWesternBukManoboWestManoboIliaManoboDibaManoboAtadManoboAtauManoboTigwManoboKalaManoboSaraBinukid
MaranaoIranunSasakBaliMamanwaIlonggoHiligaynonWarayWaray
ButuanonTausugJoloSurigaonon
AklanonBisCebuanoKalaganMansaka
TagalogAntBikolNagaCTagbanwaKa
TagbanwaAb
BatakPalawPalawanBat
BanggiSWPalawano
HanunooPalauanS inghi
DayakBakat
KatinganDayakNgaju
Kadorih
MerinaMalaMaanyan
TunjungKerinc i
Me layuSara
Me layuOganLowMalay
Indones ian
BanjareseM
Ma layBahas
Me layuBrun
Minangkaba
PhanRangCh
ChruHainanCham
MokenIbanGayo
Idaan
BunduDusun
TimugonM
ur
KelabitBar
Bintulu
BerawanLon
Belait
KenyahLong
MelanauM
uk
Lahanan
EngganoMa l
EngganoBan
Nias
TobaBatak
Madures e
Iv atanBasc
Itbayat
Babuyan
Iraralay
Itbayaten
Isamorong
Ivasay
Imorod
Yami
SambalBoto
Kapampanga
IfugaoBayn
Ka llahanKa
Ka llahanKe
Inibaloi
Pangas inan
KakidugenI
IlongotKak
IfugaoAmga
IfugaoBata
BontokGuin
BontocGuin
KankanayNo
Balangaw
KalingaGui
ItnegBinon
Gaddang
AttaPamplo
IsnegDibag
AgtaDum
agatCas
IlokanoYogyaOldJavanesW
es ternMa layoPolynes ian
BugineseSo
Ma loh
Makassa rTaeSToraja
SangirTabu
SangirSangilSa raBantikKaidipang
Goronta loHBolaangM
onPopaliaBonerateW
unaM
unaKa tobuTontem
boan
T on s ea
Bang
g ai W
diBa
ree
Wol
ioM
ori
Kaya
nUm
aJu
Buka
tPu
nanK
e la i
Wam
par
Yabe
m
Num
bam
iSib
Mou
kAr
ia
Lam
oga i
Mul
Amar
a
Seng
seng
Kaul
ongA
uVBi
libil
Meg
iar
Geda
ged
Mat
ukar
Bilia
u
Men
gen
Mal
euKo
veTa
rpia
Sobe
i
Kayu
pula
uKRi
wo
Kairi
ruKis
Wog
eoAli
Auhe
lawa
Salib
a
Suau
Ma i
s in
Gum
awan
a
Diod
io
Mol
ima
Ubir
Gapa
paiw
a
Wed
au
Dobu
an
Mis
ima
Kiliv
ila
Gaba
diDo
uraKu
niRo
ro
Mek
eoLala
Vilir
upu
Mot
u
Mag
oriS
out
Kand
asTang
aSiar
Kuan
ua
Patp
atar
Rov i
ana
S im
boNduk
e
Kus a
gheHoav
a
Lung
gaLuqa
Kubokota
Ughele
Ghanongga
Vangunu
Marovo
MbarekeSolosTaiof
Teop
NehanHapeNis sanHaku
Sis ingga
BabatanaTu
VarisiGhonVaris
iRirioSengga
BabatanaAvVaghua
Babatana
BabatanaLo
BabatanaKaLungaLunga
MonoMonoAluTorauUruava
MonoFauroBilurChekeHolo
MaringeKmaMaringeTatNggaoPoroMaringeLel
ZabanaKiaLaghuSamas
BlablangaBlablangaGKokotaKilokakaYsBanoniMadaraTungagTungKaraWes tNalikTiangTigakL ihirSunglMadakLamas
BarokMaututuNakanaiBilLaka la iVituYapes eMbughotuDhBugotuTalis eMoli
Talis eMalaGhariNdiGhariToloTalis e
Ma langoGhariNggaeMbirao
GhariTandaTalisePoleGhariNggerTalis eKoo
GhariNginiLengo
LengoGhaimLengoParip
Nggela
ToambaitaKwai
LangalangaKwaio
MbaengguuLau
Mbaele lea
KwaraaeSolFata leka
LauWa ladeLauNorthLonggu
SaaUkiNiMa
SaaSaaVillOroha
SaaUlawaSaa
Dorio
AreareMaas
AreareWaia
SaaAuluVil
BauroBaroo
BauroHaunu
Fagani
KahuaMami
SantaCatal
Aros iOneib
Aros iTawatAros i
Kahua
BauroPawaV
FaganiAguf
FaganiRihu
SantaAna
Tawaroga
TannaSouth
Kwamera
Lenakel
SyeErromanUra
ArakiSouthMere i
AmbrymSoutMota
PeteraraMa
PaameseSou
Mwotlap
Raga
SouthEfate
Orkon
Nguna
Namakir
Nati
Nahavaq
NeseTape
AnejomAnei
SakaoPortO
AvavaNeveei
Naman
Puluwatese
PuloAnnanW
oleaian
Satawalese
ChuukeseAKM
ortlockes
Carolinian
ChuukeseSonsoroles
SaipanCaroPuloAnnaW
oleai
Mokiles e
PingilapesPonapeanKiriba tiKus a ie
Marsha lles
NauruNelem
waJawe
CanalaIaai
NengoneDehu
Rotuman
WesternF ij
Maori
TahitiPenrhyn
Tuamotu
Manihiki
RurutuanTahitianM
oRarotongan
TahitianthM
arquesanM
angarevaHawaiian
RapanuiEasSam
oanBellonaTikopia
Emae
AnutaRennellese
FutunaAniwIfiraM
eleMUveaW
estFutunaEast
VaeakauTauTakuu
Tuva luSikaiana
Kapingamar
NukuoroLuangiua
PukapukaTokelau
Uve aEa stTon gan
NiueF ij ia nB au
T ean
uVa
noT a
nem
aBu
ma
Asum
boa
Tani
mbi
liNe
mba
oSe
imat
Wuv
ulu
Lou
Naun
aLe
ipon
Sivi
saTi
taLi
kum
Leve
iLo
niu
Mus
sau
Kas i
raIra
hGi
man
Buli
Mor
Min
yaifu
inBi
gaM
isoo
lAs War
open
Num
for
Mar
auAm
baiY
apen
Win
des i
Wan
North
Baba
rDa
wera
Dawe
Dai
Tela
Mas
bua
Imro
ing
Empl
awas
Eas t
Mas
e la
Seril
iCe
ntra
lMas
Sout
hEas
tBW
estD
amar
Tara
ngan
Ba
UjirN
Aru
Ngai
borS
Ar
Gese
rW
atub
ela
Ela t
KeiB
es
Hitu
Ambo
n
SouA
man
aTe
Amah
aiPa
uloh
iAl
une
Mur
nate
nAl
Bonf
iaW
erin
ama
Buru
Nam
rol
Sobo
yoM
ambo
ruM
angg
arai
Ngad
haPo
ndok
Kodi
Wej
ewaT
ana
Bim
aSo
aSa
vuW
anuka
ka
Gaur
aNgg
au
Eas tSumban
Baliled
o
Lamboya
L ioF loresT
Ende
NageKambera
PalueNitun
Anakalang
Sekar
PulauArgun
RotiTerm
an
Atoni
Mambai
TetunTerik
Kemak
Lamalera le
LamaholotI
S ika
Kedang
Era i
TalurAputai
Perai
Tugun
Iliun
Serua
NilaTeunKis ar
Roma
Eas tDamar
LetineseYamdena
KeiTanimba
Se laru
KoiwaiIriaChamorro
Lampung
Komering
Sunda
RejangRe ja
TboliTagab
TagabiliKoronadalB
SaranganiB
BilaanKoroBilaanSara
CiuliAtayaSquliqAtay
SediqBunun
ThaoFavorlang
RukaiPaiwanKanakanabu
SaaroaTsouPuyumaKav alanBasaiCentralAmiS iraya
PazehSa is ia t
N/A
B
0 10 20 30 40 50 600.3
0.35
0.4
0.45
0.5
0.55
Number of modern languages
Rec
onst
ruct
ion
erro
r ra
te
C
Rec
onst
ruct
ion
erro
r ra
te
Effect of tree quality
Effect of tree sizeFig. 1. Quantitative validation ofreconstructions and identificationof some important factors influ-encing reconstruction quality. (A)Reconstruction error rates for abaseline (which consists of pickingone modern word at random), oursystem, and the amount of dis-agreement between two linguist’smanual reconstructions. Reconstruc-tion error rates are Levenshteindistances normalized by the meanword form length so that errorscan be compared across languages.Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages. (B) Theeffect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ranon an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant.On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between usingthe consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C) Re-construction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected togo down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue”in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in Cis restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C).
4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.
Because we are ultimately interested in reconstruction, we thencompared our reconstruction system’s ability to reconstruct wordsgiven these automatically determined cognates. Specifically, wetook every cognate group found by our system (run on theOceanicsubclade) with at least two words in it. Then, we automaticallyreconstructed the Proto-Oceanic ancestor of those words, usingour system. For evaluation, we then looked at the average Lev-enshtein distance from our reconstructions to the known recon-structions described in the previous sections. This time, however,we average per modern word rather than per cognate group, toprovide a fairer comparison. (Results were not substantially dif-ferent when averaging per cognate group.) Compared with re-construction from manually labeled cognate sets, automaticallyidentified cognates led to an increase in error rate of only 12.8%and with a significant reduction in the cost of curating linguisticdatabases. See SI Appendix, Fig. S1 for the fraction of words witheach Levenshtein distance for these reconstructions.
Functional Load. To demonstrate the utility of large-scale recon-struction of protolanguages, we used the output of our system toinvestigate an open question in historical linguistics. The func-tional load hypothesis (FLH), introduced 1955 (6), claims thatthe probability that a sound will change over time is related tothe amount of information provided by a sound. Intuitively, iftwo phonemes appear only in words that are differentiated fromone another by at least one other sound, then one can argue thatno information is lost if those phonemes merge together, be-cause no new ambiguous forms can be created by the merger.A first step toward quantitatively testing the FLH was taken in
1967 (7). By defining a statistic that formalizes the amount ofinformation lost when a language undergoes a certain soundchange—on the basis of the proportion of words that are dis-criminated by each pair of phonemes—it became possible toevaluate the empirical support for the FLH. However, this initial
investigation was based on just four languages and found littleevidence to support the hypothesis. This conclusion was criti-cized by several authors (31, 32) on the basis of the small numberof languages and sound changes considered, although they pro-vided no positive counterevidence.Using the output of our system, we collected sound change
statistics from our reconstruction of 637 Austronesian languages,including the probability of a particular change as estimated by oursystem. These statistics provided the information needed to givea more comprehensive quantitative evaluation of the FLH, usinga much larger sample than previous work (details in SI Appendix,Section 2.4). We show in Fig. 3 A and B that this analysis providesclear quantitative evidence in favor of the FLH. The revealedpattern would not be apparent had we not been able to reconstructlarge numbers of protolanguages and supply probabilities of dif-ferent kinds of change taking place for each pair of languages.
DiscussionWe have developed an automated system capable of large-scalereconstruction of protolanguage word forms, cognate sets, andsound change histories. The analysis of the properties of hun-dreds of ancient languages performed by this system goes farbeyond the capabilities of any previous automated system andwould require significant amounts of manual effort by linguists.Furthermore, the system is in no way restricted to applicationslike assessing the effects of functional load: It can be used asa tool to investigate a wide range of questions about the structureand dynamics of languages.In developing an automated system for reconstructing ancient
languages, it is by no means our goal to replace the carefulreconstructions performed by linguists. It should be emphasizedthat the reconstruction mechanism used by our system ignoresmany of the phenomena normally used in manual recon-structions. We have mentioned limitations due to the transducer
(521)
( 559) ( 750)
Yakan ( 632) ( 91) ( 40)Bajo ( 375)Mapun ( 560)
SamalS ias i ( 230)
Inabaknon ( 620)
( 629) ( 630)
SubanunSin ( 628)
SubanonSio ( 362) ( 123) ( 728) ( 733)
Wes ternBuk ( 370)
ManoboWes t ( 366)
ManoboIlia
( 365)
ManoboDiba
( 27) ( 363)
ManoboAtad ( 364)
ManoboAtau
( 369)
ManoboTigw
( 614)
( 367)
ManoboKala
( 368)
ManoboSara
( 79)
Binukid
( 377)
( 376)
Maranao
( 233)
Iranun
( 43)
( 576)
Sasak
( 42)
Bali
( 405) ( 118)
( 355)
Mamanwa
( 80) ( 121) ( 507)
( 225)
Ilonggo
( 210)
Hiligaynon
( 717)
WarayWaray
( 613) ( 102)
( 103)
Butuanon
( 662)
TausugJolo
( 635)
Surigaonon
( 3)
AklanonBis
( 107)
Cebuano
( 372)
( 252)
Kalagan
( 371)
Mansaka
( 639)
TagalogAnt
( 69)
BikolNagaC
( 253)
( 641)
TagbanwaKa
( 640)
TagbanwaAb
( 495)
( 57)
BatakPalaw
( 494)
PalawanBat
( 47)
Banggi
( 547)
SWPalawano
( 208)
Hanunoo
( 493)
Palauan
( 311)
( 595)
S inghi
( 136)
DayakBakat
( 52)
( 727)
( 615)
( 267)
Katingan
( 137)
DayakNgaju
( 245)
Kadorih
( 151)
( 403)
MerinaMala
( 337)
Maanyan
( 693)
Tunjung
( 350)
( 349)
( 327)
( 279)
Kerinc i
( 400)
MelayuSara
( 398)
Melayu
( 487)
Ogan
( 331)
LowMalay
( 231)
Indones ian
( 48)
BanjareseM
( 348)
MalayBahas
( 399)
MelayuBrun
( 407)
M inangkaba
( 126)
( 125)
( 511)
PhanRangCh
( 130)
Chru
( 206)
HainanCham
( 410)
Moken
( 215)
Iban
( 187)
Gayo
( 477)
( 554)
( 217)
Idaan
( 99)
BunduDusun
( 464)
( 138)
( 677)
TimugonM
ur
( 276)
KelabitBar
( 78)
Bintulu
( 66)
( 65)
BerawanLon
( 62)
Belait
( 278)
KenyahLong
( 396)
( 397)
MelanauM
uk
( 303)
Lahanan
( 633)
( 167)
( 169)
EngganoMal
( 168)
EngganoBan
( 455)
Nias
( 679)
TobaBatak
( 341)
Madurese
( 470) ( 56)
( 55)
( 241)
( 242)
Iv atanBasc
( 237)
Itbayat
( 39)
Babuyan
( 234)
Iraralay
( 238)
Itbayaten
( 235)
Is amorong
( 240)
Iv as ay
( 228)
Imorod
( 752)
Yami
( 113)
( 561)
SambalBoto
( 263)
Kapampanga
( 469)
( 605)
( 619)
( 498)
( 64)
( 256)
( 222)IfugaoBayn
( 257)KallahanKa
( 258)KallahanKe
( 232)
Inibaloi
( 497)
Pangas inan
( 226)
( 251)
KakidugenI
( 227)
IlongotKak
( 110)
( 478)
( 219)
( 220)
IfugaoAmga
( 221)
IfugaoBata
( 90)
( 88)
( 89)BontokGuin
( 87)BontocGuin
( 262)
KankanayNo
( 41)
Balangaw
( 255)
( 254)
KalingaGui
( 239)
ItnegBinon
( 468)
( 216)
( 184)
Gaddang
( 30)
AttaPamplo
( 236)
Is negDibag
( 471)
( 2)
Agta
( 144)
DumagatCas
( 224)
Ilokano
( 243)
( 754)
Yogya
( 488)
OldJavanes
( 735)
Wes ternM
alayoPolynes ian
( 631)
( 611)
( 94)
( 93)BugineseSo
( 354)M
aloh
( 344)
Makassar
( 637)
TaeSToraja
( 568)
( 476)
( 567)SangirTabu
( 566)Sangir
( 565)SangilSara
( 50)
Bantik
( 203) ( 201)
( 248)Kaidipang
( 202)GorontaloH
( 84)BolaangM
on
( 423) ( 691)
( 518)Popalia
( 85)Bonerate
( 739) ( 747)
Wuna
( 424)M
unaKatobu (466)
(685)Tontem
boan ( 684)
Tons ea (46)
BanggaiWdi
(51)Baree
(746)
Wolio
(418)
Mori
(270) (271)
KayanUmaJu
(96)
Bukat (529)
PunanKelai
(11
1)
(158)
(523)
(736)
(462)
(213) (715)
Wam
par (749)
Yabem (484)
Numbam
iSib (451)
(713)
(624) (67)
(422)M
ouk (20)
Aria (309)
LamogaiM
ul
(7)
Amara
(500) (585)
Sengseng (268)
KaulongAuV
(61) (475)
(73)Bilibil
(394)M
egiar (188)
Gedaged
(387)M
atukar
(72)
Biliau
(401)
Mengen
(353)
Maleu
(292)
Kove
(575) (574)
(661)
Tarpia
(600)
Sobei
(272)
KayupulauK
(539)
Riwo
(579) (250)
(249)
Kairiru
(358) (284)
Kis
(743)
Wogeo
(4)
Ali
(499)
(480)
(627)
(31)
Auhelawa
(558)
Saliba
(626)
Suau
(343)
Maisin
(463) (205)
Gumawana
(104)
(140)
Diodio
(412)
Molim
a
(17) (16)
(695)
Ubir
(185)
Gapapaiwa
(720)
Wedau
(141)
Dobuan
(508)
(281)
(409)
Misim
a
(280)
Kilivila
(117) (723)
(183)
Gabadi
(482) (143)
Doura
(295)
Kuni
(541)
Roro
(395)
Mekeo
(305)
Lala
(594)
(712)
Vilirupu
(421)
Motu
(342)
MagoriSout
(404)
(448)
(610)
(502)
(261)
Kandas
(655)
Tanga
(590)
Siar
(293)
Kuanua
(501)
Patpatar
(447) (730)
(544)
Roviana
(593)
Simbo
(437)
Nduke
(296)
Kusaghe
(212)
Hoava
(335)
Lungga
(336)
Luqa
(294)
Kubokota
(696)Ughele
(193)Ghanongga
(154)
(706)Vangunu
(382)Marovo
(391)Mbareke
(440)
(602)
Solos
(572)
(646)Taiof
(668)Teop
(438)
(439)NehanHape
(458)Nissan
(207)
Haku
(129)
(597)
Sisingga
(38)
BabatanaTu
(711)
VarisiGhon
(710)
Varisi
(538)
Ririo
(584)
Sengga
(35)
BabatanaAv
(705)
Vaghua
(34)
Babatana
(37)
BabatanaLo
(36)
BabatanaKa
(416)
(334)
LungaLunga
(413)
Mono
(414)
MonoAlu
(686)
Torau
(700)
Uruava
(415)
MonoFauro
(75)
Bilur
(571)
(157)
(128)ChekeHolo
(379)MaringeKma
(381)MaringeTat
(452)NggaoPoro
(380)MaringeLel
(732) (755)ZabanaKia
(302)LaghuSamas
(124)
(82)Blablanga
(83)BlablangaG
(289)Kokota
(282)KilokakaYs
(49)
Banoni
(316)
(340)
Madara
(692)
TungagTung
(265)KaraWest
(431)Nalik
(673)Tiang
(674)Tigak
(324)LihirSungl
(338)
(339)MadakLamas
(53)Barok
(741)
(388)Maututu
(430)NakanaiBil
(304)Lakalai
(714)
Vitu
(753)
Yapese
(11
2)
(618)
(190)
(92) (393)MbughotuDh
(95)Bugotu
(204)
(651)TaliseMoli (650)TaliseMala (195)GhariNdi (194)Ghari (681)Tolo (648)Talise (347)Malango (196)
GhariNggae (392)Mbirao (199)GhariTanda (652)TalisePole (197)GhariNgger (649)TaliseKoo (198)GhariNgini (189)
(319)Lengo (320)LengoGhaim
(321)LengoParip
(453)Nggela
(346)
(345)
(473)
(678)Toambaita
(298)Kwai
(312)Langalanga
(299)Kwaio
(390)Mbaengguu
(313)Lau
(389)
Mbaelelea (301)
KwaraaeSol
(175)
Fataleka (315)
LauWalade
(314)
LauNorth
(328)
Longgu (621)
(551)
SaaUkiNiMa
(550)
SaaSaaVill
(490)
Oroha
(552)
SaaUlawa
(548)
Saa
(142)
Dorio
(18)
AreareMaas
(19)
AreareWaia
(549)
SaaAuluVil
(564)
(58)
BauroBaroo
(59)
BauroHaunu
(172)
Fagani
(247)
KahuaMami
(570)
SantaCatal
(22)
ArosiOneib
(23)
ArosiTawat
(21)
Arosi
(246)
Kahua
(60)
BauroPawaV
(173)
FaganiAguf
(17
4)
FaganiRihu
(56
9)
SantaAna
(66
3)
Tawaroga
(
536)
(70
9)
(65
7)
(65
8)
TannaSouth
(30
0)
Kwamera
(31
8)
Lenakel
(17
1)
(63
6)
SyeErroman
(69
9)
Ura
(46
7)
(72
6)
(15
)
ArakiSouth
(40
2)
Merei
(15
0)
(10)
AmbrymSout
(42
0)
Mota
(51
0)
PeteraraMa
(49
1)
PaameseSou
(42
7)
Mwotlap
(53
1)
Raga
(11
9)
(60
7)
SouthEfate
(48
9)
Orkon
(45
4)
Nguna
(43
2)
Namakir
(35
2)
(43
4)
Nati
(42
9)
Nahavaq
(44
4)
Nese
(65
9)
Tape
(1
2)
Anejom
Anei
(
557)
Saka
oPor
tO
(
351)
(
32)
Avav
a
(
445)
Neve
ei
(
433)
Nam
an
(
522)
(
406)
(
516)
(
520)
(
528)
Pulu
wate
se
(
527)
Pulo
Anna
n
(
745)
Wol
eaia
n
(
577)
Sata
wale
se
(
132)
Chuu
kese
AK
(
419)
Mor
tlock
es
(
106)
Caro
linia
n
(
131)
Chuu
kese
(
603)
Sons
orol
es
(
555)
Saip
anCa
ro
(
526)
Pulo
Anna
(
744)
Wol
eai
(
515)
(
411)
Mok
ilese
(
512)
Ping
ilape
s
(
514)
Pona
pean
(
283)
Kirib
ati
(
297)
Kusa
ie
(
385)
Mar
shal
les
(
436)
Naur
u
(
332)
(
446)
(
474)
(
441)
Nele
mwa
(
244)
Jawe
(
105)
Cana
la
(
214)
Iaai
(
443)
Neng
one
(
139)
Dehu
(
116)
(
725)
(
543)
Rotu
man
(
734)
Wes
tern
Fij
(
146)
(
513)
(
481)
(
156)
(
122)
(64
5) (
374)
Mao
ri
(
642)
Tahi
ti
(
505)
Penr
hyn
(
689)
Tuam
otu
(
361)
Man
ihik
i
(
546)
Ruru
tuan
(
643)
Tahi
tianM
o
(
534)
Raro
tong
an
(
644)
Tahi
tiant
h
(
384)
(
383)
Mar
ques
an
(
359)
Man
gare
va
(
209)
Hawa
iian
(
533)
Rapa
nuiE
as
(
563)
(
562)
Sam
oan
(
182)
(
63)
Bello
na
(
675)
Tiko
pia
(
163)
Emae
(
13)
Anut
a
(
537)
Renn
elle
se
(
180)
Futu
naAn
iw
(
218)
Ifira
Mel
eM
(
703)
Uvea
Wes
t
(
181)
Futu
naEa
st
(
704)
Vaea
kauT
au
(
162)
(
647)
Taku
u
(69
4)Tu
valu
(
592)
Sika
iana
(
264)
Kapi
ngam
ar
(48
3)Nu
kuor
o
(33
3)Lu
angi
ua
(52
4)Pu
kapu
ka
(68
0)To
kela
u
(70
2)Uv
eaEa
st
(68
3)
(68
2)To
ngan
(
459)
Niue
(
177)
Fijia
nBau
(
159)
(
707)
(
666)
Tean
u
(70
8)Va
no
(65
4)Ta
nem
a
(98
)Bu
ma
(
701)
(
26)
Asum
boa
(
656)
Tani
mbi
li
(44
2)Ne
mba
o
(1)
(
738)
(
581)
Seim
at
(74
8)
Wuv
ulu
(
160)
(
616)
(
330)
Lou
(
435)
Naun
a
(37
3)
(15
3)
(31
7)Le
ipon
(
598)
S iv i
s aTi
ta
(
729)
(
325)
L iku
m
(32
3)
Leve
i
(
329)
Loni
u
(
426)
Mus
sau
(
609)
(
608)
(
266)
Kas i
raIra
h
(
200)
Gim
an
(
97)
Buli
(
108)
(
417)
Mor
(
532)
(
408)
Min
yaifu
in
(
68)
Biga
Mis
ool
(
25)
As
(
718)
War
open
(
485)
Num
for
(
120)
(
378)
Mar
au
(
8)
Amba
iYap
en
(
742)
Win
des i
Wan
(
519)
(
33)
(
465)
(
460)
North
Baba
r
(
135)
Dawe
raDa
we
(
134)
Dai
(
612)
(
622)
(
667)
Tela
Mas
bua
(
229)
Imro
ing
(
164)
Empl
awas
(
386)
(
148)
Eas t
Mas
ela
(
588)
Seril
i
(
115)
Cent
ralM
as
(
606)
Sout
hEas
tB
(
724)
Wes
tDam
ar
(
24)
(
660)
Tara
ngan
Ba
(
697)
UjirN
Aru
(
450)
Ngai
borS
Ar
(
114)
(
152)
(
45)
(
192)
(
191)
Gese
r
(
719)
Wat
ubel
a
(
161)
E lat
KeiB
es
(
586)
(
486)
(
587)
(
9)
(
211)
Hitu
Ambo
n
(
604)
SouA
man
aTe
(
6)
Amah
ai
(
503)
Paul
ohi
(
698)
(
5)
Alun
e
(
425)
Mur
nate
nAl
(
86)
Bonf
ia
(
722)
Wer
inam
a
(
101)
Buru
Nam
rol
(
601)
Sobo
yo
(
77)
(
357)
Mam
boru
(
360)
Man
ggar
ai
(
449)
Ngad
ha
(
517)
Pond
ok
(
287)
Kodi
(
721)
Wej
ewaT
ana
(
76)
Bim
a
(
599)
Soa
(
578)
Savu
(
716)
Wan
ukak
a
(
186)
Gaur
aNgg
au
( 1
49)
Eas tSum
ban
( 44
)
Baliledo
( 30
8)
Lamboya
( 16
6)
( 32
6)
L ioF loresT
( 16
5)
Ende
( 42
8)
Nage
( 25
9)
Kambera
( 49
6)
PalueNitun
( 11
)
Anakalang
( 46
1)
( 58
2)
Sekar
( 52
5)
PulauArgun
( 67
6)
( 47
9)
( 73
1)
( 54
2)
RotiTerm
an
( 29
)
Atoni
( 15
5)
( 35
6)
Mambai
( 66
9)
TetunTerik
( 27
7)
Kemak
( 17
8)
( 30
7)
Lamalerale
( 30
6)
LamaholotI
( 59
1)
S ika
( 27
3)
Kedang
(623)
( 74
0)
( 17
0)
Erai
( 65
3)
Talur
( 14
)
Aputai
( 50
6)
Perai
( 69
0)
Tugun
(223)
Iliun
(671)
(457)
(589)
Serua
(456)
Nila
(670)
Teun
(286)
(285)
Kis ar
(540)
Roma
(145)
Eas tDamar
(322)
Letinese
(617)
(275)
(751)
Yamdena
(274)
KeiTanimba
(583)
Selaru
( 288)
KoiwaiIria
( 127)
Chamorro
( 509)
( 310)
Lampung
( 290)
Komering
( 634)
Sunda
( 535)
RejangReja
( 74)
( 664)
( 665)
TboliTagab
( 638)
Tagabili
( 81)
( 291)
KoronadalB
( 573)
SaranganiB
( 70)
BilaanKoro
( 71)
BilaanSara
( 179)
( 28)
( 133)
CiuliAtaya
( 625)
SquliqAtay
( 580)
Sediq
( 100)
Bunun
( 737)
( 672)Thao
( 176)Favorlang
( 545)
Rukai
( 492)
Paiwan
( 688)
( 260)Kanakanabu
( 553)Saaroa
( 687) Tsou
( 530)
Puyuma
( 147) ( 472)
( 269) Kavalan ( 54) Basai
( 109) CentralAmi ( 596) S iraya
( 504) Pazeh ( 556) Sais iat
Proto-Austronesian
(Fig. S2)
(Fig. S3)
(Fig. S5)
(Fig. S4)
f
gdb
nm
k
hv
t
s
qp
z x
pppp g qqqqqe
a
o
i uy
e11
12
1 2
3
8
56
10
4 97
A B
C D
(
543)
Ro
(
734)
Wes
te
3)
(48
1)
(
156)
(
122)
(
645) (
374)
Mao
ri
(
642)
Tahi
ti
(
505)
Penr
hyn
(
689)
Tuam
otu
(
361)
Man
ihik
i
(
546)
Ruru
tuan
(
643)
Tahi
tianM
o
(
534)
Raro
tong
an
(
644)
Tahi
tiant
h
(
384)
(
383)
Mar
ques
an
(
359)
Man
gare
va
(
209)
Hawa
iian
(
533)
Rapa
nuiE
as
(
563)
(56
2)Sa
moa
n
(
182)
(
63)
Bello
na
(
675)
Tiko
pia
(
163)
Emae
(
13)
Anut
a
(53
7)Re
nnel
lese
(
180)
Futu
naAn
iw
(
218)
Ifira
Mel
eM
(
703)
Uvea
Wes
t
(
181)
Futu
naEa
st
(
704)
Vaea
kauT
au
(
162)
(
647)
Taku
u
(69
4)Tu
valu
(
592)
Sika
iana
(
264)
Kapi
ngam
ar
(48
3)Nu
kuor
o
(33
3)Lu
angi
ua
(52
4)Pu
kapu
ka
(68
0)To
kela
u
(70
2)Uv
eaEa
st
(68
3)
(68
2)To
ngan
(
459)
Niue
(177
)Fi
jianB
au
(66
6)Te
an
i iiFraction of substitution errors
Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant isavailable in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parenthesesattached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change alongeach branch, as inferred automatically by our system in SI Appendix, Section 4. (B) The most supported sound changes across the phylogeny, with the width oflinks proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, andbackness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realisticcross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/(2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), andvowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actuallyrepresented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesianfamily (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are suc-cessfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pairðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among humanexperts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels arecommon sources of disagreement.
Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227
COMPU
TERSC
IENCE
SPS
YCHOLO
GICALAND
COGNITIVESC
IENCE
SSE
ECO
MMEN
TARY
formalism but other limitations include the lack of explicit mod-eling of changes at the level of the phoneme inventories used bya language and the lack of morphological analysis. Challengesspecific to the cognate inference task, for example difficulties withpolymorphisms, are also discussed in more detail in SI Appendix.Another limitation of the current approach stems from the as-sumption that languages form a phylogenetic tree, an assumptionviolated by borrowing, dialect variation, and creole languages.However, we believe our system will be useful to linguists inseveral ways, particularly in contexts where there are large num-bers of languages to be analyzed. Examples might include usingthe system to propose short lists of potential sound changes andcorrespondences across highly divergent word forms.An exciting possible application of this work is to use the
model described here to infer the phylogenetic relationshipsbetween languages jointly with reconstructions and cognate sets.This will remove a source of circularity present in most previouscomputational work in historical linguistics. Systems for inferringphylogenies such as ref. 13 generally assume that cognate sets aregiven as a fixed input, but cognacy as determined by linguists is inturn motivated by phylogenetic considerations. The phylogenetictree hypothesized by the linguist is therefore affecting the treebuilt by systems using only these cognates. This problem can beavoided by inferring cognates at the same time as a phylogeny,something that should be possible using an extended version ofour probabilistic model.Our system is able to reconstruct the words that appear in
ancient languages because it represents words as sequences ofsounds and uses a rich probabilistic model of sound change. Thisis an important step forward from previous work applying com-putational ideas to historical linguistics. By leveraging the fullsequence information available in the word forms in modernlanguages, we hope to see in historical linguistics a breakthroughsimilar to the advances in evolutionary biology prompted by thetransition from morphological characters to molecular sequencesin phylogenetic analysis.
Materials and MethodsThis section provides a more detailed specification of our probabilisticmodel. See SI Appendix, Section 1.2 for additional content on the algo-rithm and simulations.
Distributions. The conditional distributions over pairs of evolving strings arespecified using a lexicalized stochastic string transducer (33).
Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have aword form x =wcℓ′. The generative process for producing y =wcℓ works asfollows. First, we consider x to be composed of characters x1x2 . . . xn, withthe first and last ones being a special boundary symbol x1 =#∈Σ, which isnever deleted, mutated, or created. The process generates y = y1y2 . . . yn in nchunks yi ∈Σ*; i∈ f1; . . . ;ng, one for each xi . The yi s may be a single char-acter, multiple characters, or even empty. To generate yi , we define a mu-tation Markov chain that incrementally adds zero or more characters to aninitially empty yi . First, we decide whether the current phoneme in the topword t = xi will be deleted, in which case yi = e (the probabilities of the
decisions taken in this process depend on a context to be specified shortly).If t is not deleted, we choose a single substitution character in the bottomword. We write S =Σ∪ fζg for this set of outcomes, where ζ is the specialoutcome indicating deletion. Importantly, the probabilities of this multino-mial can depend on both the previous character generated so far (i.e., therightmost character p of yi−1) and the current character in the previousgeneration string (t), providing a way to make changes context sensitive.This multinomial decision acts as the initial distribution of the mutationMarkov chain. We consider insertions only if a deletion was not selected inthe first step. Here, we draw from a multinomial over S , where this time thespecial outcome ζ corresponds to stopping insertions, and the other ele-ments of S correspond to symbols that are appended to yi . In this case, theconditioning environment is t = xi and the current rightmost symbol p in yi .Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote theprobabilities over the substitution and insertion decisions in the currentbranch ℓ′→ ℓ. A similar process generates the word at the root ℓ of a tree orwhen an innovation happens at some language ℓ, treating this word asa single string y1 generated from a dummy ancestor t = x1. In this case, onlythe insertion probabilities matter, and we separately parameterize theseprobabilities with θR;t;p;ℓ. There is no actual dependence on t at the root orinnovative languages, but this formulation allows us to unify the parame-terization, with each θω;t;p;ℓ ∈RjΣj+1, where ω∈ fR; S; Ig. During cognate in-ference, the decision to innovate is controlled by a simple Bernoulli randomvariable ngℓ for each language in the tree. When known cognate groups areassumed, ncℓ is set to 0 for all nonroot languages and to 1 for the rootlanguage. These Bernoulli distributions have parameters νℓ.
Mutation distributions confined in the family of transducers miss certainphylogenetic phenomena. For example, the process of reduplication (as in“bye-bye”, for example) is a well-studied mechanism to derive morpholog-ical and lexical forms that is not explicitly captured by transducers. The samesituation arises in metatheses (e.g., Old English frist > English first). How-ever, these changes are generally not regular and therefore less informative(1). Moreover, because we are using a probabilistic framework, these eventscan still be handled in our system, even though their costs will simply not beas discounted as they should be.
Note also that the generative process described in this section does notallow explicit dependencies to the next character in ℓ. Relaxing this as-sumption can be done in principle by using weighted transducers, but at thecost of a more computationally expensive inference problem (caused by thetransducer normalization computation) (34). A simpler approach is to usethe next character in the parent ℓ′ as a surrogate for the next character in ℓ.Using the context in the parent word is also more aligned to the standardrepresentation of sound change used in historical linguistics, where thecontext is defined on the parent as well.
More generally, dependencies limited to a bounded context on the parentstring can be incorporated in our formalism. By bounded, we mean that itshould be possible to fix an integer k beforehand such that all of themodeled dependencies are within k characters to the string operation. Thecaveat is that the computational cost of inference grows exponentially in k.We leave open the question of handling computation in the face of un-bounded dependencies such as those induced by harmony (35).
Parameterization. Instead of directly estimating the transition probabilities ofthemutationMarkov chain (which could be done, in principle, by taking themto be the parameters of a collection of multinomial distributions) we expressthem as the output of a multinomial logistic regression model (36). This
A B
0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40
0.2
0.4
0.6
0.8
1
Functional load
Mer
ger
post
erio
r
0 2x10-5 4x10-5 6x10-5 8x10-5 1x10-40
0.2
0.4
0.6
0.8
1
Functional load
Mer
ger
post
erio
r
0
1
102
104
106Fig. 3. Increasing the number of languages we canreconstruct gives new ways to approach questions inhistorical linguistics, such as the effect of functionalload on the probability of merging two sounds. Theplots shown are heat maps where the color encodesthe log of the number of sound changes that fallinto a given two-dimensional bin. Each soundchange x > y is encoded as a pair of numbers in theunit square, ðl;mÞ, as explained in Materials andMethods. To convey the amount of noise one couldexpect from a study with the number of languagesthat King previously used (7), we first show in A theheat map visualization for four languages. Next, weshow the same plot for 637 Austronesian languagesin B. Only in this latter setup is structure clearlyvisible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of thefunctional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details.
4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 Bouchard-Côté et al.
model specifies a distribution over transition probabilities by assigningweights to a set of features that describe properties of the sound changesinvolved. These features provide a more coherent representation of thetransition probabilities, capturing regularities in sound changes that reflectthe underlying linguistic structure.
We used the following feature templates: OPERATION, which identifieswhether an operation in the mutation Markov chain is an insertion, a de-letion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or theend of an insertion event; MARKEDNESS, which consists of language-specificn-gram indicator functions for all symbols in Σ (during reconstruction, onlyunigram and bigram features are used for computational reasons; for cognateinference, only unigram features are used); FAITHFULNESS, which consists ofindicators for mutation events of the form 1 [ x > y ], where x ∈Σ, y ∈ S .Feature templates similar to these can be found, for instance, in the work ofrefs. 37 and 38, in the context of string-to-string transduction models used incomputational linguistics. This approach to specifying the transition proba-bilities produces an interesting connection to stochastic optimality theory (39,40), where a logistic regression model mediates markedness and faithfulnessof the production of an output form from an underlying input form.
Data sparsity is a significant challenge in protolanguage reconstruction.Although the experiments we present here use an order of magnitude morelanguages than previous computational approaches, the increase in observeddata also brings with it additional unknowns in the form of intermediateprotolanguages. Because there is one set of parameters for each language,adding more data is not sufficient to increase the quality of the recon-struction; it is important to share parameters across different branches in thetree to benefit from having observations from more languages. We used thefollowing technique to address this problem: We augment the parameteri-zation to include the current language (or language at the bottom of thecurrent branch) and use a single, global weight vector instead of a set ofbranch-specific weights. Generalization across branches is then achievedby using features that ignore ℓ, whereas branch-specific features dependon ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, andFAITHFULNESS have universal and branch-specific versions.
Using these features and parameter sharing, the logistic regression modeldefines the transition probabilities of the mutation process and the rootlanguage model to be
θω;t;p;ℓ = θω;t;p;ℓðξ; λÞ= expfÆλ; fðω; t;p; ℓ; ξÞægZðω; t;p; ℓ; λÞ × μðω; t; ξÞ; [1]
where ξ∈ S , f : fS; I;Rg×Σ×Σ× L× S →Rk is the feature function (whichindicates which features apply for each event), Æ · ; · æ denotes inner product,and λ∈Rk is a weight vector. Here, k is the dimensionality of the feature spaceof the logistic regression model. In the terminology of exponential families, Zand μ are the normalization function and the reference measure, respectively:
Zðω; t;p; ℓ; λÞ=Xξ′∈S
exp�Æλ; f
�ω; t;p; ℓ; ξ′
�æ�
μðω; t; ξÞ=
8>><>>:
0 if ω= S; t =#; ξ≠#0 if ω=R; ξ= ζ0 if ω≠R; ξ=#1 o:w:
Here, μ is used to handle boundary conditions, ensuring that the resultingprobability distribution is well defined.
During cognate inference, the innovation Bernoulli random variables νgℓare similarly parameterized, using a logistic regression model with two kindsof features: a global innovation feature κglobal ∈R and a language-specificfeature κℓ ∈R. The likelihood function for each νgℓ then takes the form
νgℓ =1
1+ exp�−κglobal − κℓ
�: [2]
ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 fromthe National Science Foundation.
1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague,Netherlands).
2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture andEnvironment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia).
3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton,New York).
4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and LinguisticHypotheses, eds Blench R, Spriggs M (Routledge, London).
5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press,Cambridge, UK).
6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phoneticsound changes] (Maisonneuve & Larose, Paris).
7. King R (1967) Functional load and sound change. Language 43:831–852.8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple
alignment. Bioinformatics 17(9):803–820.9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence
alignment. Mol Biol Evol 21(3):529–540.10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of
alignment and phylogeny. Bioinformatics 22(16):2047–2048.11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Ox-
ford, UK).12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor recon-
struction. Genome Res 18(11):1829–1843.13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of
Austronesian expansion. Nature 405(6790):1052–1055.14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics.
Trans Philol Soc 100:59–129.15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical In-
verse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonaldInstitute, Cambridge, UK).
16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatoliantheory of Indo-European origin. Nature 426(6965):435–439.
17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new meth-odology for reconstructing the evolutionary history of natural languages. Language81:382–420.
18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P,Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp111–118.
19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological im-plications. Assoc Comput Linguist 45:65–72.
20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogenyin historical linguistics: Methodological explorations applied in Island Melanesia.Language 84:710–759.
21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia).22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word
lists in four daughter languages. J Quant Linguist 7:233–244.23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Tor-
onto, Toronto).24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc
Comput Linguist 45:15–22.25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum like-
lihood alignment of DNA sequences. J Mol Evol 33(2):114–124.26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data
via the EM algorithm. J R Stat Soc B 39:1–38.27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel
trees. Adv Neural Inf Process Syst 21:177–184.28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database:
From bioinformatics to lexomics. Evol Bioinform Online 4:271–283.29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas,
TX, 16th Ed).30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press,
Oxford, UK).31. Hockett CF (1967) The quantification of functional load. Word 23:320–339.32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and
Beyond (Benjamins, Amsterdam).33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned
genomic regions with integrated parameter estimation. Genome Biol 9(10):R147.34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical
Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin).35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary
articulation agreement. Phonology 24:77–120.36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London).37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions
with finite-state methods. Empirical Methods on Natural Language Processing 13:1080–1089.
38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion.Eurospeech 8:2033–2036.
39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximumentropy model. Proceedings of the Workshop on Variation Within Optimality Theoryeds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122.
40. Wilson C (2006) Learning phonology with substantive bias: An experimental andcomputational study of velar palatalization. Cogn Sci 30(5):945–982.
41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansionpulses and pauses in Pacific settlement. Science 323(5913):479–483.
42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesiancomparative linguistics. Inst Linguist Acad Sinica 1:31–94.
Bouchard-Côté et al. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229
COMPU
TERSC
IENCE
SPS
YCHOLO
GICALAND
COGNITIVESC
IENCE
SSE
ECO
MMEN
TARY
Supporting Information:Automated reconstruction of ancient languages using
probabilistic models of sound change
Alexandre Bouchard-CoteDavid Hall
Thomas L. GriffithsDan Klein
This Supporting Information describes the learning and inference algorithms used by our system (Section 1),details of simulations supporting the accuracy of our system and analyzing the effects of functional load (Section 2),lists of reconstructions (Section 3), lists of sound changes (Section 4), and analysis of frequent errors (Section 5).
1 Learning and inferenceThe generative model introduced in the Materials and Methods section of the main paper sets us up with twoproblems to solve: estimating the values of the parameters characterizing the distribution on sound changes oneach branch of the tree, and inferring the optimal values of the strings representing words in the unobservedprotolanguages. Section 1.1 introduces the full objective function that we need to optimize in order to estimatethese quantities. Section 1.2 describes the Monte Carlo Expectation-Maximization algorithm we used for solvingthe learning problem. We present the algorithm for inferring ancestral word forms in Section 1.4.
1.1 Full objective functionThe generative model specified in the Materials and Methods section of the main paper defines an objective func-tion that we can optimize in order to find good protolanguage reconstructions. This objective function takes theform of a regularized log-likelihood, combining the probability of the observed languages with additional con-straints intended to deal with data sparsity. This objective function can be written concisely if we let Pλ(·),Pλ(·|·)denote the root and branch probability models described in the Materials and Methods section of the paper (withtransition probabilities given by the above logistic regression model), I(c), the set of internal (non-leaf) nodes inτ(c), pa(`), the parent of language `, r(c), the root of τ(c) and W (c) = (Σ∗)|I(c)|. The full objective function isthen
Li(λ, κ) =
C∑c=1
log∑
~w∈W (c)
Pλ(wc,r(c))∏`∈I(c)
Pλ(wc,`|wc,pa(`), nc`)Pκ(nc`|νc`)−||λ||22 + ||κ||22
2σ2(1)
where the second term is a standard L2 regularization penalty intended to reduce over-fitting due to data sparsity(we used σ2 = 1) [1]. The goal of learning is to find the value of λ, the parameters of the logistic regression modelfor the transition probabilities, that maximizes this function.
1
1.2 A Monte Carlo Expectation-Maximization algorithm for reconstructionOptimization of the objective function given in Equation 1 is done using a Monte Carlo variant of the Expectation-Maximization (EM) algorithm [2]. This algorithm breaks down into two steps, an E step in which the objectivefunction is approximated and an M step in which this approximate objective function is optimized. The M stepis convex and computed using L-BFGS [3] but the E step is intractable [4], in part because it requires solvingthe problem of inferring the words in the protolanguages. We approximate the solution to this inference problemusing a Markov chain Monte Carlo (MCMC) algorithm [5]. This algorithm repeatedly samples words from theprotolanguages until it converges on the distribution implied by our generative model. Since this procedure isguaranteed to find a local maximum of the objective, we ran it with several different random initializations of themodel parameters. The next two subsections provide the details of these two parts of our system.
1.2.1 E step: Inferring the posterior over string reconstructions
In the E step, the inference problem is to compute an expectation under the posterior over strings in a protolanguagegiven observed word forms at the leaves of the tree.1 The typical approach in biological InDel models [6] is touse Gibbs sampling, where the entire string at each node in the tree is repeatedly resampled, conditioned on itsparent and children. We will call this method Single Sequence Resampling (SSR). While conceptually simple, thisapproach suffers from mixing problems in large trees, since it can take a long time for information to propagatefrom one region of the tree to another [6]. Consequently, we use a different MCMC procedure, called AncestryResampling (AR) that alleviates these mixing problems. This method was originally introduced for biologicalapplications [7], but commonalities between the biological and linguistic cases make it possible to use it in ourmodel.
Concretely, the problem with SSR arises when the tree under consideration is large or unbalanced. In this case,it can take a long time for information from the observed languages to propagate to the root of the tree. Indeed,samples at the root will initially be independent of the observations. AR addresses this problem by resampling onethin vertical slice of all sequences at a time, called an ancestry (for the precise definition of the algorithm, see [7]).Slices condition on observed data, avoiding mixing problems, and can propagate information rapidly across thetree. We ran the ancestry resampling algorithm for a number of iterations that increased linearly with the number ofiterations of the EM algorithm that had been completed, resulting in an approximation regime that could allow theEM algorithm to converge to a solution [8]. To speed-up the large experiments, we also used an approximation inAR. This approximation is based on fixing the value of a set-valued auxiliary variables zg , where zg = {wg,`}, and` ranges over the set of all languages (both internal and at the leaves). Conditioning on these variables, samplingwg,`|zg can be done exactly using dynamic programming and rejection sampling.
1.2.2 M step: Convex optimization of the approximate objective
In the M step, we individually update the parameters λ and κ as specified in Equation 1. We show how λ is updatedin this section, κ can be optimized similarly. Let C = (ω, t, p, `) denote local transducer contexts from the spaceC = {S, I,R} × Σ × Σ × L of all such contexts. Let N(C, ξ) be the expected number of times the transitionξ was used in context C in the preceding E-step. Given these sufficient statistics, the estimate of λ is given byoptimizing the expected complete (regularized) log-likelihood O(λ) derived from the original objective functiongiven Equation [1] in the Materials and Methods section of the main paper (ignoring terms that do not involve λ),
O(λ) =∑C∈C
∑ξ∈S
N(C, ξ)[〈λ, f(C, ξ)〉 − log
∑ξ′
exp{〈λ, f(C, ξ′)〉}]− ||λ||
2
2σ2.
1To be precise: the posterior is over both protolanguage strings and the derivations between these strings and the modern words.
2
We use L-BFGS [3] to optimize this convex objective function. L-BFGS requires the partial derivatives
∂O(λ)
∂λj=∑C∈C
∑ξ∈S
N(C, ξ)[fj(C, ξ)−
∑ξ′
θC(ξ′;λ)fj(C, ξ′)
]− λjσ2
= Fj −∑C∈C
∑ξ∈S
N(C, ·)θC(ξ′;λ)fj(C, ξ′)−λjσ2,
where Fj =∑C∈C
∑ξ∈S N(C, ξ)fj(C, ξ) is the empirical feature vector and N(C, ·) =
∑ξN(C, ξ) is the
number of times context C was used. Fj and N(C, ·) do not depend on λ and thus can be precomputed at thebeginning of the M-step, thereby speeding up each L-BFGS iteration.
1.3 Approximate Expectation-Maximization for cognate inferenceThe procedure for cognate inference is similar: we again operate within the Expectation-Maximization framework,and the M-steps are identical. However, because of different characteristics of the cognate inference problem, wemake a few different choices for approximations for the E-step. The Monte Carlo inference algorithm describedin the preceding section works exceedingly well for reconstructing words for known cognates. However, a MonteCarlo approach to determining which words are cognate requires resampling the innovation variables ng`, whichis likely to lead to slow mixing of the Markov chain.
We therefore take a different approach when doing cognate inference. Here, we restrict our system to notperform inference over all possible reconstructions for all words, but to only use words that correspond to someobserved modern word with the same meaning. The simplification is of course false, but it works well in practice.Specifically, we perform inference on the tree for each gloss using message passing [9], also known as pruningin the computational biology literature [10], where each message µ(wg`) has a non-zero score only when wg` isone of the observed modern word forms in gloss g. The result of inference yields expected alignments counts foreach character in each language along with the expected number of innovations that occur at each language. Theseexpectations can then be used in the convex M-step.
1.4 Ancestral word form reconstructionIn the E step described in the preceding section, a posterior distribution π over ancestral sequences given observedforms is approximated by a collection of samplesX1, X2, . . . XS . In this section, we describe how this distributionis summarized to produce a single output string for each cognate set.
This algorithm is based on a fundamental Bayesian decision theoretic concept: Bayes estimators. Given a lossfunction over strings Loss : Σ∗ × Σ∗ → [0,∞), an estimator is a Bayes estimator if it belongs to the random set:
argminx∈Σ∗
Eπ Loss(x,X) = argminx∈Σ∗
∑y∈Σ∗
Loss(x, y)π(y).
Bayes estimators are not only optimal within the Bayesian decision framework, but also satisfy frequentist opti-mality criteria such as admissibility [11]. In our case, the loss we used is the Levenshtein [12] distance, denotedLoss(x, y) = Lev(x, y) (we discuss this choice in more detail in Section 2).
Since we do not have access to π, but rather to an approximation based on S samples, the objective functionwe use for reconstruction rewrites as follows:
argminx∈Σ∗
∑y∈Σ∗
Lev(x, y)π(y) ≈ argminx∈Σ∗
1
S
S∑s=1
Lev(x,Xs)
= argminx∈Σ∗
S∑s=1
Lev(x,Xs).
3
The raw samples contain both derivations and strings for all protolanguages, whereas we are only interested inreconstructing words in a single protolanguage. This is addressed by marginalization, which is done in samplingrepresentations by simply discarding the irrelevant information. Hence, the random variables Xs in the aboveequation can be viewed as being string-valued random variables.
Note that the optimum is not changed if we restrict the minimization to be taken on x ∈ Σ∗ such thatm ≤ |x| ≤M where m = mins |Xs|,M = maxs |Xs|. However, even with this simplification, optimization is intractable.As an approximation, we considered only strings built by at most k contiguous substrings taken from the wordforms in X1, X2, . . . , XS . If k = 1, then it is equivalent to taking the min over {Xs : 1 ≤ s ≤ S}. At the otherend of the spectrum, if k = S, it is exact. This scheme is exponential in k, but since words are relatively short, wefound that k = 2 often finds the same solution as higher values of k.
1.5 Finding Cognate GroupsOur cognate model finds cuts to the phylogeny in order to determine which words are cognate with one another.However, this approach cannot straightforwardly handle instances where the evolution of the words did not followstrictly treelike behavior. This limitation applies to polymorphisms—where multiple words for a given meaningare available in a language—and also for borrowing—where the words did not evolve according to the phylogeny.
However, we can modify the inference procedure to capture these kinds of behaviors using a post hoc agglom-erative “merge” procedure. Specifically, we can run our procedure to find an initial set of cognate groups, andthen merge those cognate groups that produce an increase in model score. That is, we create several initial smallsubtrees containing some cognates, and then stitch them together into one or more larger trees. Thus, non-treelikebehaviors like borrowing will be represented as multiple trees that “overlap.” For instance, if two languages eachhave two words for one meaning (say A and B), then the initial stage might find that the two A’s are cognate,leaving the two B’s as singleton cognate groups. However, merging these two words into a single group will likelyproduce a gain in likelihood, and so we can merge them. Note this procedure is unlikely to work well for longdistance borrowings, as the sound changes involved in long distance borrowing are likely to be very different fromthose according to the phylogeny. Nevertheless, we found this procedure to be effective in practice.
2 ExperimentsIn this section, we give more details on the results in the main paper, namely those concerning validation usingreconstruction error rate, and those measuring the effect of the tree topology and the number of languages. Wealso include additional comparisons to other reconstruction methods, as well as cognate inference results. InSection 2.1, we analyze in isolation the effects of varying the set of features, the number of observed languages,the topology, and the number of iterations of EM. In Section 2.2 we compare performance to an oracle and to twoother systems.
Evaluation of all methods was done by computing the Levenshtein distance [12] (uniform-cost edit distance)between the reconstruction produced by each method and the reconstruction produced by linguists. The Leven-shtein distance is the minimum number of substitutions, insertions, or deletions of a phoneme required to transformone word to another. While the Levenshtein distance misses important aspects of phonology (all phoneme substi-tutions are not equal, for instance), it is parameter-free and still correlates to a large extent with linguistic qualityof reconstruction. It is also superior to held-out log-likelihood, which fails to penalize errors in the modelingassumptions, and to measuring the percentage of perfect reconstructions, which ignores the degree of correctnessof each reconstructed word. We averaged this distance across reconstructed words to report a single number foreach method. The statistical significance of all performance differences are assessed using a paired t-test withsignificance level of 0.05.
4
2.1 Evaluating system performanceWe used the Austronesian Basic Vocabulary Database (ABVD) [13] as the basis for a series of experiments usedto evaluate the performance of our system and the factors relevant to its success. The database, downloaded fromhttp://language.psy.auckland.ac.nz/austronesian/
on August 7, 2010, includes partial cognacy judgments and IPA transcriptions,2 as well as a several reconstructedprotolanguages.
In our main experiments, we used the tree topology induced by the Ethnologue classification [14] in orderto facilitate interpretability of the results (i.e. so that we can give well-known names to clades in the tree inour analyses). Our method does not require specifying branch lengths since our unsupervised learning procedureprovides a more flexible way of estimating the amount of change between points of the phylogenetic tree.
The first claim we verified experimentally is that having more observed languages aids reconstruction of pro-tolanguages. In the results in this section, we used the subset of the languages under Proto-Oceanic (POc) tospeed-up the computations. To test this hypothesis we added observed modern languages in increasing order ofdistance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstructionare added first. This prevents the effects of adding a close language after several distant ones being confused withan improvement produced by increasing the number of languages.
The results are reported in Figure 1(c) of the main paper. They confirm that large-scale inference is desirablefor automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increaseexcept from 32 to 64 languages, where the average edit distance improvement was 0.05.
We then conducted a number of experiments intended to identify the contribution made by different factors itincorporates. We found that all of the following ablations significantly hurt reconstruction: using a flat tree (inwhich all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree,dropping the markedness features, dropping the faithfulness features, and disabling sharing across branches. Theresults of these experiments are shown in Table S.1.
For comparison, we also included in the same table the performance of a semi-supervised system trained byK-fold validation. The system was run K = 5 times, with 1 − K−1 of the POc words given to the system asobservations in the graphical model for each run. It is semi-supervised in the sense that target reconstructions formany internal nodes are not available in the dataset, so they are still not filled.3
2.2 Comparisons against other methodsThe first competing method, PRAGUE was introduced in [15]. In this method, the word forms in a given protolan-guage are reconstructed using a Viterbi multi-alignment between a small number of its descendant languages. Thealignment is computed using hand-set parameters. Deterministic rules characterizing changes between pairs ofobserved languages are extracted from the alignment when their frequency is higher than a threshold, and a proto-phoneme inventory is built using linguistically motivated rules and parsimony. A reconstruction of each observedword is first proposed independently for each language. If at least two reconstructions agree, a majority vote istaken, otherwise no reconstruction is proposed. This approach has several limitations. First, it is not tractable forlarger trees, since the time complexity of their multi-alignment algorithm grows exponentially in the number oflanguages. Second, deterministic rules, while elegant in theory, are not robust to noise: even in experiments withonly four daughter languages, a large fraction of the words could not be reconstructed.
Since PRAGUE does not scale to large datasets, we also built a second, more tractable baseline. This newbaseline system, CENTROID, computes the centroid of the observed word forms in Levenshtein distance. Let
2While most word forms in ABVD are encoded using IPA, there are a few exceptions, for example language family specific conventionssuch as the usage of the okina symbol (’) for glottal stops (P) in some Polynesian languages or ad hoc conventions such as using the bigram‘ng’ for the agma symbol (N). We have preprocessed as many of these exceptions as possible, and probabilist methods are generally robust toreasonable amounts of encoding glitches.
3We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did notperform as well—this is consistent with the -Topology experiment of Table S.1.
5
Lev(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baselinesystem to return:
argminx∈Σ∗
∑y∈O
Lev(x, y),
where O = {y1, . . . , y|O|} is the set of observed word forms. This objective function is motivated by Bayesiandecision theory [11], and shares similarity to the more sophisticated Bayes estimator described in Section 2. How-ever it replaces the samples obtained by MCMC sampling by the set of observed words. Similarly to the algorithmof Section 2, we also restrict the minimization to be taken on x ∈ Σ(O)∗ such that m ≤ |x| ≤ M and to stringsbuilt by at most k contiguous substrings taken from the word forms in O, where m = mini |yi|,M = maxi |yi|and Σ(O) is the set of characters occurring in O. Again, we found that k = 2 often finds the same solution ashigher values of k. The difference was in all the cases not statistically significant, so we report the approximationk = 2 in what follows.
We also compared against an oracle, denoted ORACLE, which returns
argminy∈O
Lev(y, x∗),
where x∗ is the target reconstruction. We will denote it by ORACLE. This is superior to picking a single closestlanguage to be used for all word forms, but it is possible for systems to perform better than the oracle since it hasto return one of the observed word forms. Of course, this scheme is only available to assess system performanceon held-out data: It cannot make new predictions.
We performed the comparison against a previous system proposed in [15] on the same dataset and experimentalconditions as used in [15]. The PMJ dataset was compiled by [16], who also reconstructed the correspondingprotolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word formscould be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUEsince the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, withan average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06,partly due to the small number of word forms involved.
To get a more extensive comparison, we considered the hybrid system that returns PRAGUE’s reconstructionwhen possible and otherwise back off to the Sundanese (Snd.) modern form, then Madurese (Mad.), Malay (Mal.)and finally Javanic (Jv.) (the optimal back-off order). In this case, we obtained an edit distance of 1.86 using oursystem against 2.33 for PRAGUE, a statistically significant difference.
We also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to theexperimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, whilethe system’s average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform theCENTROID baseline (1.79).
2.3 Cognate recoveryTo test the effectiveness of our cognate recovery system, we ran our system on all of the Oceanic languages in theABVD, which comprises roughly half of the Austronesian languages. We then evaluated the pairwise precision,recall, F1, and purity scores, defined as follows. Let G = {G1, G2, . . . , Gg} denote the known partitions of theforms into cognates, and let F = {F1, F2, . . . , Ff} denote the inferred partitions. Let pairs(F ) denote the set ofunordered pairs of indices in the same partition in F : pairs(F ) = {{i, j} : ∃k s.t. i, j ∈ Fk, i 6= j}, and similarly
6
for pairs(G). The metrics are defined as follows:
precision =|pairs(G) ∩ pairs(F )|
|pairs(F )|,
recall =|pairs(G) ∩ pairs(F )|
|pairs(G)|,
F1 = 2precision · recall
precision + recall,
purity =1
N
∑f
maxg|Gg ∩ Ff |.
Using these metrics, we found that our system achieved a precision of 84.4, recall of 62.1, F1 of 71.5, andcluster purity of 91.8. Thus, over 9 out of 10 words are correctly grouped, and our system errs on the side of under-grouping words rather than clustering words that are not cognates. Since the null hypothesis in historical linguisticsis to deem words to be unrelated unless proven otherwise, a slight under-grouping is the desired behavior.
Since we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability toreconstruct words given these automatically determined cognates. Specifically, we took every cognate group foundby our system (run on the Oceanic subclade) with at least two words in it. That is, we excluded words that oursystem found to be isolates. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words usingour system (using the auxiliary variables z described in Section 1.2.1).
For evaluation, we then looked at the average edit distance from our reconstructions to the known reconstruc-tions described in the previous sections. This time, however, we average per modern word rather than per cognategroup, to provide a fairer comparison. (Results were not substantially different averaging per cognate group.)
Using known cognates from the ABVD, there was an average reconstruction error of 2.19, versus 2.47 for theautomatically reconstructed cognates, or an increase in error rate of 12.8%. The fraction of words with each Leven-shtein distance for these reconstructions is shown in Figure S.1. While the plots are similar, the automatic cognatesexhibit a slightly longer tail. Thus, even with automatic cognates, the reconstruction system can reconstruct wordsfaithfully in many cases, only failing in a few instances.
2.4 Computation of functional loadsTo measure functional load quantitatively, we used the same estimator as the one used in [17]. This definitionis based on associating a context vector cx,` to each phoneme and language. For a given language with N(`)phoneme tokens, these context vectors are defined as follows: first, fix an enumeration order of all the contextsfound in the corpus, where the context is defined as a pair of phonemes, one at the left and one at the right of aposition. Element i in this enumeration will correspond to component i of the context vectors. Then, the value ofcomponent i in context vector cx,` is set to be the number of time phoneme x occurred in context i and languagel. Finally, King’s definition of functional load FL`(x, y) is the dot product of the two induced context vectors:
FL`(x, y) =1
N(`)2〈cx,`, cy,`〉 =
1
N(`)2
∑i
cx,`(i)× cy,`(i),
where the denominator is simply a normalization that insures FL`(x, y) ≤ 1. Note that if x and y are in comple-mentary distribution in language `, then the two vectors cx,` and cy,` are orthogonal. The functional load is indeedzero in this case.
In Figure 3 of the main paper, we show heat maps where the color encodes the log of the number of soundchanges that fall into a given 2-dimensional bin. Each sound change x > y is encoded as pair of numbers in theunit interval, (l, m), where l is an estimate of the functional load of the pair and m is the posterior fraction of theinstances of the phoneme x that undergo a change to y. We now describe how l, m were estimated. The posterior
7
fraction m for the merger x > y between languages pa(`) → ` is easily computed from the same expectedsufficient statistics used for parameter estimation:
m`(x > y) =
∑p∈ΣN(S, x, p, `, y)∑
p′∈Σ
∑y′∈ΣN(S, x, p′, `, y′)
.
The estimate of the functional load requires additional statistics, i.e. the expected context vectors cx,` and expectedphoneme token counts N(`), but these can be readily extracted from the output of the MCMC sampler. Theestimate is then:
l`(x, y) =1
N(`)2〈cx,`, cy,`〉.
Finally, the set of points used to construct the heat map is:{(lpa(`)(x, y), m`(x > y)
): ` ∈ L− {root}, x ∈ Σ, y ∈ Σ, x 6= y
}.
3 Reconstruction listsIn Table S.2–S.4, we show the lists of consensus reconstructions produced by our system (the ‘Automatic’ column).For comparison, we also include a baseline (randomly picking one modern word), and the edit distances (when itis greater than zero). In Proto-Oceanic, since two manual reconstructions are available, we include the distancesof the automatic reconstruction to both manual reconstructions (‘P-A’ and ‘B-A’) and the distance between the twomanual reconstructions (‘B-P’).
4 Sound changesIn Figures S.2–S.5, we zoom and rotate each quadrant of the tree shown in Figure 2 of the main paper. Forinformation on the most frequent change of each branch, refer to the row in Table S.5 corresponding to the code inparenthesis attached to each branch. By most frequent, we mean the change with the highest expected count in thelast EM iteration, collapsing contexts for simplicity. In cases where no change is observed with an expected countof more than 1.0, we skip the corresponding entry—this can be caused for example by languages where cognacyinformation was too sparse in the dataset. The functional load, normalized to be in the interval [0, 1] is also shownfor reference.
5 Frequent errorsIn Figure S.6, we analyze the frequent discrepancy between the PAn reconstruction from our system with thosefrom [18]. The most frequent problematic substitutions, insertions, and deletions are shown. These frequencieswere obtained by aligning, after running the system, the reconstructions with the references. The aligner is a pair-HMM trained via EM, using only phoneme identity and gap information [19]. The frequencies were extractedfrom the posterior alignments.
References and Notes[1] Hastie T, Tibshiranim R, Friedman J (2009) The Elements of Statistical Learning (Springer).
[2] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society. Series B (Methodological) 39:1–38.
8
[3] Liu DC, Nocedal J, Dong C (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming45:503–528.
[4] Lunter GA, Miklos I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J.Comp. Biol. 10:869–889.
[5] Tierney L (1994) Markov chains for exploring posterior distributions. The Annals of Statistics 22:1701–1728.
[6] Holmes I, Bruno WJ (2001) Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820.
[7] Bouchard-Cote A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Advances in Neural Information ProcessingSystems 21:177–184.
[8] Caffo B, Jank W, Jones G (2005) Ascent-based Monte Carlo EM. Journal of the Royal Statistical Society - Series B 67:235–252.
[9] Judea P (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. (Morgan Kaufmann).
[10] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368–376.
[11] Robert CP (2001) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer).
[12] Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10.
[13] Greenhill S, Blust R, Gray R (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. EvolutionaryBioinformatics 4:271–283.
[14] Lewis MP, ed (2009) Ethnologue: Languages of the World, Sixteenth edition. (SIL International).
[15] Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quanti-tative Linguistics 7:233–244.
[16] Nothofer B (1975) The reconstruction of Proto-Malayo-Javanic (M. Nijhoff).
[17] King R (1967) Functional load and sound change. Language 43:831–852.
[18] Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst. Linguist. Acad. Sinica1:31–94.
[19] Berg-Kirkpatrick T, Bouchard-Cote A, DeNero J, Klein D (2010) Painless unsupervised learning with features. Proceedings of the NorthAmerican Conference on Computational Linguistics pp 582–590.
Condition Edit dist.Unsupervised full system 1.87-FAITHFULNESS 2.02-MARKEDNESS 2.18-Sharing 1.99-Topology 2.06Semi-supervised system 1.75
Table S.1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharingcorresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKEDNESS thatare branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connectmodern languages to POc. The semi-supervised system is described in the text. All differences (compared to theunsupervised full system) are statistically significant.
9
Table S.2. Proto-Austronesian reconstructions
Gloss Reference Baseline Distance Automatic Distance
tohold(1) *gemgem *higem 3 *gemgemsmoke(1) *qebel *q1v1l 3 *qebeltoscratch(1) *ka Raw *kagaw 1 *ka Rawand(2) *mah *ma 1 *ma 1leg/foot(1) *qaqay *ai 4 *qaqayshoulder(1) *qaba Ra *vala 4 *qaba Rawoman/female(1) *bahi *babinay 4 *vavaian 5left(1) *kawi Ri *kayli 3 *kawi Riday(1) *qalejaw *andew 5 *qalejawmother(1) *tina *qina 1 *tinatosuck(1) *sepsep *sipsip 2 *sepsepsmall(2) *kedi *kedhi 1 *kedinight(1) *be RNi *beyi 2 *be RNitosqueeze(1) *pe Req *perah 3 *pereq 1tohit(1) *palu *mipalo 3 *palurat(1) *labaw *lapo 3 *kulavaw 3you(1) *ikamu *kamo 2 *kamu 1toplant(1) *mula *h1mula 2 *mulatoswell(1) *ba Req *ba Req *ba Reqtosee(1) *kita *kita *kitaone(1) *isa *sa 1 *isatosleep(1) *tudu R *maturug 4 *tudu Rdog(1) *wasu *kahu 2 *vatu 2topound,beat(20) *tutuh *nutu 2 *tutu 1stone(1) *batu *batu *batugreen(1) *mataq *mataq *mataqfather(1) *tama *qama 1 *tamathis(1) *ini *eni 1 *ani 1tooth(1) *nipen *lipon 2 *nipentochoose(1) *piliq *piliP 1 *piliqstar(1) *bituqen *bitun 2 *bituqentobuy(23) *baliw *taiw 2 *taiw 2tovomit(1) *utaq *mutjaq 2 *utaqtowork(1) *qumah *quma 1 *quma 1wide(1) *malawas *Gawig 5 *malabe R 3tocut,hack(1) *ta Raq *ta Raq *ta Raqtofear(1) *matakut *takuP 3 *matakuttolive,bealive(1) *maqudip *maPuNi 3 *maqudipthunder(3) *de RuN *zuN 3 *deruN 1tofly(2) *layap *layap *layaptoshoot(1) *panaq *fanak 2 *panaqname(1) *Najan *ngaza 4 *Najantobuy(1) *beli *poli 2 *beliand(1) *ka *kae 1 *kawhen?(1) *ijan *pirang 3 *pijan 1todig(1) *kalih *kali 1 *kali 1ash(1) *qabu *abu 1 *qabubig(1) *ma Raya *ma Raya *ma Rayatocut,hack(3) *tektek *tutek 2 *tektekroad/path(1) *zalan *dalan 1 *zalantostand(1) *di Ri *diri 1 *di Risand(1) *qenay *one 4 *qenay
continued on next page
10
continued from previous page
Gloss Reference Baseline Distance Automatic Distance
meat/flesh(31) *isi *ici 1 *isiwhat?(2) *nanu *anu 1 *anu 1toswell(26) * Ribawa *malifawa 4 *abeh 5bird(2) *qayam *ayam 1 *qayamhand(1) *lima *lime 1 *limain,inside(1) *idalem *dalale 4 *idalemif(2) *nu *no 1 *nutoknow,beknowledgeable(2) *bajaq *mafanaP 5 *mafanaP 5at(20) *di *ri 1 *diwind(2) *bali *feli 2 *beliu 2new(1) *mabaqe Ru *baPru 5 *vaquan 6blood(1) *da Raq *da Raq *da Raqbreast(1) *susu *soso 2 *susui(1) *iaku *ako 2 *iakusalt(1) *qasi Ra *sie 4 *qasi Ratoflow(1) *qalu R *ilir 4 *qalu Rfive(1) *lima *lima *limaat(1) *i *i *iother(1) *duma *duma *dumaleaf(2) *bi Raq *bela 3 *bela 3all(1) *amin *kemon 3 *aminrotten(1) *mabu Raq *mavuk 4 *mabu Ruk 2tocook(1) *tanek *tan@k 1 *tanekhead(1) *qulu *PuGuh 3 *qulumouth(2) *Nusu *Nutu 1 *Nuju 1house(1) * Rumaq *uma 2 * Rumaqif(1) *ka *ke 1 *kaneck(1) *liqe R *liq1g 2 *liqe Rneedle(1) *za Rum *dag1m 3 *za Rumhe/she(1) *siia *ia 2 *siiafruit(1) *buaq *buaq *buaqback(1) *likud *likudeP 2 *likudtochew(2) *qelqel *qmelqel 1 *qmelqel 1salt(2) *timus *timus *timuswe(2) *kami *sikami 2 *kamilong(1) *inaduq *nandu 3 *anaduq 1we(1) *ikita *itam 3 *kita 1three(1) *telu *t1lu 1 *telulake(1) *danaw *ranu 3 *danawtoeat(1) *kaen *kman 2 *kman 2no,not(3) *ini *ini *iniwhere?(1) *inu *sadinno 5 *ainu 1how?(1) *kuja *gagua 4 *kua 1tothink(34) *nemnem *n1mn1m 2 *kinemnem 2who?(2) *siima *cima 2 *tima 2tobite(1) *ka Rat *kagat 1 *ka Rattail(1) *iku R *ikog 2 *iku R
11
Table S.3. Proto-Oceanic reconstructions
Reconstructions Pairwise distances
Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A
fish(1) *ikan *ikan *ikanfive(1) *lima *lima *limawhat?(1) *sapa *saa *sava 1 1 1meat/flesh(1) *pisiko *pisako *kiko 1 3 3star(1) *pituqun *pituqon *vetuqu 1 4 3fog(1) *kaput *kaput *kabu 2 2toscratch(44) *karu *kadru *kadru 1 1shoulder(1) *pa Ra *qapa Ra *vara 2 4 2where?(3) *pai *pea *vea 2 1 3toclimb(2) *sake *sake *cake 1 1toeat(1) *kani *kani *kanitwo(1) *rua *rua *ruadry(11) *maca *masa *mamasa 1 2 3narrow(1) *kopit *kopit *kapi 2 2todig(1) *keli *keli *kelibone(2) *su Ri *su Ri *sui 1 1stone(1) *patu *patu *patuleft(1) *mawi Ri *mawi Ri *mawii 1 1they(1) *ira *ira *sira 1 1toliedown(1) *qinop *qeno *eno 2 1 3tohide(1) *puni *puni *vuni 1 1rope(1) *tali *tali *talismoke(2) *qasu *qasu *qasuwhen?(1) *Naican *Naijan *Nisa 1 3 3we(2) *kamami *kami *kami 2 2this(1) *ne *ani *eni 2 1 2egg(1) *qatolu R *katolu R *tolu 1 3 3stick/wood(1) *kayu *kayu *kai 2 2tosit(16) *nopo *nopo *nofo 1 1toshoot(1) *panaq *pana *pana 1 1liver(1) *qate *qate *qateneedle(1) *sa Rum *sa Rum *sau 2 2feather(1) *pulu *pulu *vulu 1 1topound,beat(2) *tutuk *tuki *tutuk 3 3near(9) *tata *tata *tataheavy(1) *mamat *mapat *mamava 1 3 2year(1) *taqun *taqun *taqu 1 1old(1) *matuqa *matuqa *matuqafire(1) *api *api *avi 1 1tochoose(1) *piliq *piliq *vili 2 2rain(1) *qusan *qusan *usa 2 2togrow(1) *tubuq *tubuq *tubu 1 1tosee(1) *kita *kita *kitatohear(1) *roNo R *roNo R *roNo 1 1tochew(1) *mamaq *mamaq *mama 1 1louse(1) *kutu *kutu *kutuwind(1) *aNin *mataNi *mataNi 4 4bad,evil(1) *saqat *saqat *saqa 1 1hand(1) *lima *lima *limatoflow(1) *tape *tape *tave 1 1night(1) *boNi *boNi *boNi
continued on next page
12
continued from previous page
Reconstructions Pairwise distances
Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A
day(5) *qaco *qaco *qaso 1 1tospit(14) *qanusi *qanusi *aNusu 3 3person/humanbeing(1) *taumataq *tamwata *tamata 3 1 2tovomit(1) *mumutaq *mumuta *muta 1 2 3name(1) *Najan *qajan *qasa 1 2 3snake(12) *mwata *mwata *mwataman/male(1) *mwa Ruqane *taumwaqane *mwane 5 5 4tobreathe(1) *manawa *manawa *manawafar(1) *sauq *sauq *sau 1 1tobuy(1) *poli *poli *voli 1 1tovomit(8) *luaq *luaq *lua 1 1tocook(9) *tunu *tunu *tunuthick(3) *matolu *matolu *matoluleg/foot(1) *waqe *waqe *waqetobite(1) *ka Rat *ka Rati *karat 1 2 1leaf(1) *raun *rau *dau 1 1 2sky(1) *laNit *laNit *laNi 1 1todrink(1) *inum *inum *inumtostand(2) *tuqur *taqur *tuqu 1 2 1i(1) *au *au *yau 1 1warm(1) *mapanas *mapanas *mavana 2 2moon(1) *pulan *pulan *vula 2 2how?(1) *kua *kuya *kua 1 1three(1) *tolu *tolu *tolutoplant(2) *tanum *tanom *tan@m 1 1 1mosquito(1) *namuk *namuk *namu 1 1bird(1) *manuk *manuk *manu 1 1four(1) *pani *pat *vati 2 2 2water(2) *wai R *wai R *wai 1 1one(1) *sakai *tasa *sa 3 2 3skin(1) *kulit *kulit *kulittoyawn(1) *mawap *mawap *mawa 1 1he/she(1) *ia *ia *ianose(1) *isuN *ijuN *isu 1 2 1thatch/roof(1) *qatop *qatop *qato 1 1towalk(2) *pano *pano *vano 1 1flower(1) *puNa *puNa *vuNa 1 1dust(1) *qapuk *qapuk *avu 3 3neck(18) * Ruqa * Ruqa *ua 2 2eye(1) *mata *mata *matafather(1) *tama *tamana *tama 2 2tofear(1) *matakut *matakut *matakutroot(2) *waka Ra *waka R *waka 1 1 2tostab,pierce(8) *soka *soka *sokabreast(1) *susu *susu *susutolive,bealive(1) *maqurip *maqurip *maquri 1 1head(1) *qulu *qulu *quluthou(1) *ko *iko *kou 1 2 1fruit(1) *puaq *puaq *vua 2 2
13
Table S.4. Proto-Polynesian reconstructions
Gloss Reference Baseline Distance Automatic Distance
rope(9) *taula *taura 1 *taulashort(11) *pukupuku *puPupuPu 2 *pukupukustrikewithfist *moto *moko 1 *motocoral *puNa *puNa *puNacook *tafu *tahu 1 *tafupainful,sick(1) *masaki *mahaki 1 *masakidownwards *hifo *ifo 1 *hifodive *ruku *uku 1 *rukutoface *haNa *haNa *haNagrasp *kapo *Papo 1 *kapobay *faNa *hana 2 *faNatail *siku *siPu 1 *sikutoblow(6) *pupusi *pupuhi 1 *pupusichannel *awa *ava 1 *awapandanus *fara *fala 1 *faraurinate *mimi *mimi *miminavel *pito *pito *pitogall *Pahu *au 2 *Pahuwave *Nalu *Nalu *Nalusleep *mohe *moe 1 *mohewarm(1) *mafanafana *mafanafana *mafanafanabeak *Nutu *Nutu *Nututooth *nifo *niho 1 *nifodance *saka *haka 1 *sakaleg *waPe *wae 1 *waPedrown *lemo *lemo *lemowater *wai *vai 1 *wainose *isu *isu *isutaro *talo *kalo 1 *talotohide(1) *funi *huna 2 *suna 2upwards *hake *aPe 2 *hakeoverripe *pePe *pee 1 *pePetosit(16) *nofo *noho 1 *nofored *kula *Pula 1 *kulatochew(1) *mama *mama *mamathree(1) *tolu *tolu *tolutosleep(10) *mohe *moe 1 *mohenine *hiwa *iwa 1 *hiwadawn *ata *ata *atafeather(1) *fulu *fulu *fulucanoe *waka *vaPa 2 *wakaleft(11) *sema *hema 1 *semaashes *refu *lehu 2 *refudew *sau *sau *sausmall *riki *riki *rikihouse *fale *hale 1 *falevoice *lePo *leo 1 *lePooctopus *feke *hePe 2 *feketoclimb(2) *kake *aPe 2 *ake 1sea *tahi *kai 2 *tahiday *Paho *ao 2 *Pahobranch *maNa *maNa *maNa
continued on next page
14
continued from previous page
Gloss Reference Baseline Distance Automatic Distance
lovepity *PaloPofa *aroha 5 *PaloPofaflyrun *lele *rere 2 *lelethick(3) *matolutolu *matolutolu *matolutoru 1tostrippeel *hisi *hihi 1 *hisisitdwell *nofo *noho 1 *nofoblack *kele *Pele 1 *keletorch *rama *ama 1 *rama
Table S.5. Sound changes
Code Parent Child Functional load Number of occurrences
(1) v (ProtoOcean) > p (AdmiraltyIslands) 0.09884 5.67800(2) a (Northern Dumagat) > 2 (Agta) 5.132e-05 110.26986(3) q (Bisayan) > P (AklanonBis) 0.03289 46.31015(4) a (Schouten) > @ (Ali) 7.158e-05 5.45255(5) e (UlatInai) > a (Alune) 1.00000 1.31832(6) e (SeramStraits) > o (Amahai) 0.05802 6.91108(7) a (SouthwestNewBritain) > e (Amara) 0.19090 15.52212(8) a (CentralWestern) > e (AmbaiYapen) 0.21003 5.74372(9) i (SeramStraits) > a (Ambon) 0.27139 3.04082(10) a (EastVanuatu) > e (AmbrymSout) 0.09085 14.42495(11) s (BimaSumba) > h (Anakalang) 0.03743 10.01598(12) a (Vanuatu) > e (AnejomAnei) 0.08559 14.43646(13) f (Futunic) > p (Anuta) 0.09282 19.96156(14) n (Wetar) > N (Aputai) 0.04788 7.10435(15) t (WestSanto) > r (ArakiSouth) 0.21996 20.80080(16) Not enough data available for reliable sound change estimates(17) d (NorthPapuanMainlandDEntrecasteaux) > t (AreTaupota) 0.06738 2.92590(18) O (Southern Malaita) > o (AreareMaas) 0.01151 1.99913(19) O (Southern Malaita) > o (AreareWaia) 0.01151 1.99971(20) a (Bibling) > e (Aria) 0.29446 4.99706(21) Not enough data available for reliable sound change estimates(22) O (SanCristobal) > o (ArosiOneib) 0.00562 1.99971(23) Not enough data available for reliable sound change estimates(24) b (ProtoCentr) > p (Aru) 0.13711 9.35892(25) a (RajaAmpat) > E (As) 0.05156 4.74243(26) o (Utupua) > u (Asumboa) 0.07562 3.70741(27) a (Central Manobo) > o (AtaTigwa) 0.00291 7.29063(28) s (Formosan) > h (Atayalic) 0.04098 4.31225(29) l (West NuclearTimor) > n (Atoni) 0.23175 7.02772(30) t (Ibanagic) > q (AttaPamplo) 0.47297 7.87084(31) a (Suauic) > e (Auhelawa) 0.18692 7.98362(32) N (MalekulaCentral) > n (Avava) 0.10747 15.49566(33) a (ProtoCentr) > e (Babar) 0.04891 5.61192(34) O (Choiseul) > o (Babatana) 0.01953 8.94227(35) O (Choiseul) > o (BabatanaAv) 0.01953 1.99952(36) O (Choiseul) > o (BabatanaKa) 0.01953 2.99952(37) Not enough data available for reliable sound change estimates(38) O (Choiseul) > o (BabatanaTu) 0.01953 2.99961(39) v (Ivatan) > b (Babuyan) 0.01574 16.97554(40) a (BorneoCoastBajaw) > e (Bajo) 0.04774 21.59109(41) a (NuclearCordilleran) > 2 (Balangaw) 0.00387 37.89097(42) N (BaliSasak) > n (Bali) 0.19166 13.91349(43) e (ProtoMalay) > @ (BaliSasak) 6.064e-04 26.22332(44) h (BimaSumba) > s (Baliledo) 0.03743 7.64100(45) p (East CentralMaluku) > f (BandaGeser) 0.05559 3.52479(46) a (Sulawesi) > o (BanggaiWdi) 0.06991 8.61615(47) a (Palawano) > o (Banggi) 6.735e-04 7.69053(48) e (LocalMalay) > a (BanjareseM) 0.10667 36.24790(49) a (SouthNewIrelandNorthwestSolomonic) > o (Banoni) 0.16106 5.69268(50) r (Sangiric) > d (Bantik) 0.01606 6.99267(51) b (Sulawesi) > w (Baree) 0.05030 10.71565(52) q (ProtoMalay) > P (Barito) 0.04087 15.54090(53) e (Madak) > o (Barok) 0.04576 1.91240(54) u (Northern EastFormosan) > o (Basai) 0.00130 5.88492(55) u (BashiicCentralLuzonNorthernMindoro) > o (Bashiic) 0.02423 63.00884(56) r (NorthernPhilippine) > y (BashiicCentralLuzonNorthernMindoro) 0.01665 3.54936(57) N (Palawano) > n (BatakPalaw) 0.04752 4.94491(58) O (SanCristobal) > o (BauroBaroo) 0.00562 1.99981(59) Not enough data available for reliable sound change estimates(60) Not enough data available for reliable sound change estimates(61) p (Vitiaz) > f (Bel) 0.01771 1.06548(62) u (BerawanLowerBaram) > o (Belait) 0.03125 12.34195(63) f (Futunic) > h (Bellona) 0.01009 15.97130(64) t (Pangasinic) > s (Benguet) 0.09759 1.35377
continued on next page
15
continued from previous pageCode Parent Child Functional load Number of occurrences
(65) u (BerawanLowerBaram) > o (BerawanLon) 0.03125 13.55252(66) k (NorthSarawakan) > P (BerawanLowerBaram) 0.15797 10.19426(67) e (SouthwestNewBritain) > i (Bibling) 0.03719 2.29657(68) O (RajaAmpat) > o (BigaMisool) 0.03843 5.00642(69) k (CentralPhilippine) > c (BikolNagaC) 9.303e-05 12.87548(70) a (Blaan) > O (BilaanKoro) 0.01132 23.88863(71) N (Blaan) > n (BilaanSara) 0.08320 18.08416(72) Not enough data available for reliable sound change estimates(73) u (Northern NuclearBel) > o (Bilibil) 0.03687 2.95970(74) a (ProtoMalay) > 1 (Bilic) 0.00777 19.43024(75) O (SouthNewIrelandNorthwestSolomonic) > o (Bilur) 0.00242 1.97890(76) t (BimaSumba) > d (Bima) 0.08028 9.97797(77) b (ProtoCentr) > w (BimaSumba) 0.08312 11.67021(78) a (NorthSarawakan) > e (Bintulu) 0.07029 9.74568(79) 1 (Manobo) > u (Binukid) 0.03862 2.90829(80) 1 (CentralPhilippine) > u (Bisayan) 0.01651 18.53014(81) 1 (Bilic) > a (Blaan) 0.08598 17.06499(82) Not enough data available for reliable sound change estimates(83) Not enough data available for reliable sound change estimates(84) t (GorontaloMongondow) > s (BolaangMon) 0.03159 4.79938(85) o (TukangbesiBonerate) > O (Bonerate) 0.02521 18.76480(86) u (Seram) > i (Bonfia) 0.14937 11.37310(87) u (Bontok) > o (BontocGuin) 0.04335 58.88829(88) a (BontokKankanay) > 1 (Bontok) 0.10466 3.41602(89) q (Bontok) > P (BontokGuin) 8.294e-04 49.64130(90) u (NuclearCordilleran) > o (BontokKankanay) 0.03806 9.39059(91) u (SuluBorneo) > o (BorneoCoastBajaw) 0.05043 3.53556(92) v (GelaGuadalcanal) > p (Bughotu) 0.09530 1.99015(93) a (Bugis) > @ (BugineseSo) 0.00654 17.32980(94) N (SouthSulawesi) > k (Bugis) 0.10991 2.98167(95) s (Bughotu) > h (Bugotu) 0.03395 8.55895(96) e (KayanMurik) > a (Bukat) 0.06056 4.02644(97) a (SouthHalmahera) > e (Buli) 0.05583 3.50307(98) o (Vanikoro) > O (Buma) 0.13019 23.27248(99) a (Sabahan) > o (BunduDusun) 0.05640 1.58162(100) u (Formosan) > o (Bunun) 2.254e-04 11.98634(101) u (CentralMaluku) > o (BuruNamrol) 0.01570 19.92112(102) o (South Bisayan) > u (ButuanTausug) 0.03947 3.20776(103) u (ButuanTausug) > o (Butuanon) 0.03983 30.32254(104) r (NorthPapuanMainlandDEntrecasteaux) > l (Bwaidoga) 0.10631 8.92246(105) a (NewCaledonian) > E (Canala) 0.05974 4.64394(106) N (ProtoChuuk) > n (Carolinian) 0.04042 23.82083(107) u (Bisayan) > o (Cebuano) 0.03656 8.68282(108) l (SouthHalmaheraWestNewGuinea) > r (CenderawasihBay) 0.11907 11.14761(109) u (EastFormosan) > o (CentralAmi) 4.190e-04 12.82795(110) u (SouthCentralCordilleran) > o (CentralCordilleran) 0.02941 3.43974(111) e (ProtoMalay) > @ (CentralEastern) 6.064e-04 32.49321(112) j (ProtoOcean) > s (CentralEasternOceanic) 0.00410 2.00058(113) e (BashiicCentralLuzonNorthernMindoro) > i (CentralLuzon) 0.01063 2.14101(114) R (ProtoCentr) > r (CentralMaluku) 0.02692 12.80932(115) u (MaselaSouthBabar) > E (CentralMas) 0.04503 4.21108(116) N (RemoteOceanic) > g (CentralPacific) 0.00869 11.59225(117) k (Peripheral PapuanTip) > g (CentralPapuan) 0.11515 5.29705(118) u (MesoPhilippine) > o (CentralPhilippine) 0.01374 38.98219(119) x (NortheastVanuatuBanksIslands) > k (CentralVanuatu) 0.02455 3.46857(120) i (CenderawasihBay) > u (CentralWestern) 0.12684 1.99006(121) u (Bisayan) > o (Central Bisayan) 0.03656 7.64073(122) l (East Nuclear Polynesian) > r (Central East Nuclear Polynesian) 0.02709 2.35911(123) a (Manobo) > 1 (Central Manobo) 0.09378 18.50241(124) a (SantaIsabel) > u (Central SantaIsabel) 0.27278 1.22175(125) t (Chamic) > k (ChamChru) 0.34269 7.73602(126) y (Malayic) > i (Chamic) 0.00237 1.73963(127) b (ProtoMalay) > p (Chamorro) 0.15451 10.03113(128) 7 (East SantaIsabel) > g (ChekeHolo) 0.03879 10.10942(129) o (SouthNewIrelandNorthwestSolomonic) > O (Choiseul) 0.00242 8.56855(130) a (ChamChru) > @ (Chru) 0.02139 31.97181(131) a (ProtoChuuk) > e (Chuukese) 0.11108 36.12356(132) a (ProtoChuuk) > e (ChuukeseAK) 0.11108 59.55135(133) @ (Atayalic) > a (CiuliAtaya) 0.02870 6.00504(134) e (North Babar) > o (Dai) 0.02409 5.70633(135) n (North Babar) > l (DaweraDawe) 0.16784 23.27929(136) N (LandDayak) > n (DayakBakat) 0.26785 7.13031(137) N (South West Barito) > n (DayakNgaju) 0.16918 29.67016(138) a (NorthSarawakan) > e (Dayic) 0.07029 5.15706(139) a (LoyaltyIslands) > e (Dehu) 0.20861 4.46889(140) g (Bwaidoga) > y (Diodio) 0.08008 12.32717(141) l (NorthPapuanMainlandDEntrecasteaux) > r (Dobuan) 0.10631 12.30865(142) p (Southern Malaita) > b (Dorio) 8.555e-04 1.99973(143) l (Nuclear WestCentralPapuan) > r (Doura) 0.13489 5.63596(144) u (Northern Dumagat) > o (DumagatCas) 0.01544 18.63984(145) s (SouthwestMaluku) > h (EastDamar) 0.02479 6.75499(146) a (CentralPacific) > o (EastFijianPolynesian) 0.23158 1.54042
continued on next page
16
continued from previous pageCode Parent Child Functional load Number of occurrences
(147) c (Formosan) > t (EastFormosan) 0.04224 6.87583(148) b (MaselaSouthBabar) > v (EastMasela) 0.00417 5.01447(149) i (BimaSumba) > u (EastSumban) 0.17438 28.36485(150) b (NortheastVanuatuBanksIslands) > v (EastVanuatu) 0.16115 2.78838(151) e (Barito) > i (East Barito) 0.02495 4.21938(152) p (CentralMaluku) > f (East CentralMaluku) 0.09093 7.47879(153) u (Manus) > o (East Manus) 0.01231 4.11061(154) g (NewGeorgia) > h (East NewGeorgia) 0.02220 5.62094(155) a (NuclearTimor) > e (East NuclearTimor) 0.14170 3.51835(156) l (Nuclear Polynesian) > r (East Nuclear Polynesian) 0.04761 79.14382(157) O (SantaIsabel) > o (East SantaIsabel) 0.02383 4.65059(158) @ (CentralEastern) > o (EasternMalayoPolynesian) 0.00303 18.76994(159) a (RemoteOceanic) > e (EasternOuterIslands) 0.14981 13.43554(160) a (AdmiraltyIslands) > e (Eastern AdmiraltyIslands) 0.16811 7.58136(161) a (BandaGeser) > o (ElatKeiBes) 0.05446 15.50349(162) r (SamoicOutlier) > l (Ellicean) 0.03331 5.08687(163) a (Futunic) > e (Emae) 0.20786 3.97652(164) e (SouthwestBabar) > E (Emplawas) 0.18331 11.05576(165) Not enough data available for reliable sound change estimates(166) i (BimaSumba) > e (EndeLio) 0.04706 5.22061(167) a (Sumatra) > @ (Enggano) 0.00149 1.61296(168) Not enough data available for reliable sound change estimates(169) Not enough data available for reliable sound change estimates(170) f (Wetar) > h (Erai) 0.03346 7.93007(171) a (Vanuatu) > e (Erromanga) 0.08559 16.54778(172) O (SanCristobal) > o (Fagani) 0.00562 1.99961(173) O (SanCristobal) > o (FaganiAguf) 0.00562 1.99952(174) O (SanCristobal) > o (FaganiRihu) 0.00562 1.99690(175) Not enough data available for reliable sound change estimates(176) u (WesternPlains) > o (Favorlang) 0.01234 20.09559(177) s (EastFijianPolynesian) > c (FijianBau) 0.04333 9.93879(178) f (Timor) > w (FloresLembata) 0.01025 7.38989(179) l (ProtoAustr) > r (Formosan) 0.01978 16.25315(180) r (Futunic) > l (FutunaAniw) 0.05835 2.94909(181) r (Futunic) > l (FutunaEast) 0.05835 48.57376(182) l (SamoicOutlier) > r (Futunic) 0.03331 74.90293(183) l (WestCentralPapuan) > r (Gabadi) 0.13055 5.24097(184) u (Ibanagic) > o (Gaddang) 0.01362 18.75014(185) e (Are) > i (Gapapaiwa) 0.06186 4.23744(186) a (BimaSumba) > o (GauraNggau) 0.13543 29.26141(187) a (ProtoMalay) > @ (Gayo) 0.00259 24.49089(188) r (Northern NuclearBel) > z (Gedaged) 0.02021 7.64916(189) c (GelaGuadalcanal) > s (Gela) 0.01050 2.08324(190) k (SoutheastSolomonic) > g (GelaGuadalcanal) 0.01749 19.76693(191) b (GeserGorom) > w (Geser) 0.06536 3.91631(192) f (BandaGeser) > w (GeserGorom) 0.05946 4.12917(193) Not enough data available for reliable sound change estimates(194) g (Guadalcanal) > G (Ghari) 4.995e-04 9.42542(195) Not enough data available for reliable sound change estimates(196) h (Guadalcanal) > G (GhariNggae) 5.670e-05 2.00000(197) O (Guadalcanal) > o (GhariNgger) 0.00601 2.99863(198) Not enough data available for reliable sound change estimates(199) Not enough data available for reliable sound change estimates(200) a (SouthHalmahera) > o (Giman) 0.11966 11.83686(201) a (GorontaloMongondow) > o (Gorontalic) 0.12679 8.78982(202) k (Gorontalic) > P (GorontaloH) 0.00973 18.65349(203) a (Sulawesi) > o (GorontaloMongondow) 0.06991 26.41042(204) s (GelaGuadalcanal) > c (Guadalcanal) 0.01050 3.50693(205) Not enough data available for reliable sound change estimates(206) Not enough data available for reliable sound change estimates(207) N (NehanNorthBougainville) > n (Haku) 0.10208 6.26107(208) 1 (MesoPhilippine) > u (Hanunoo) 0.02599 24.61094(209) t (Marquesic) > k (Hawaiian) 0.31942 56.48048(210) u (Peripheral Central Bisayan) > o (Hiligaynon) 0.02028 1.95625(211) r (Ambon) > l (HituAmbon) 0.03275 16.21990(212) g (West NewGeorgia) > G (Hoava) 0.00152 7.69893(213) a (NorthNewGuinea) > e (HuonGulf) 0.13866 4.84173(214) a (LoyaltyIslands) > e (Iaai) 0.20861 8.25767(215) u (Malayic) > o (Iban) 0.00621 12.93856(216) k (NorthernCordilleran) > q (Ibanagic) 0.40890 11.09862(217) N (Sabahan) > n (Idaan) 0.13834 5.84917(218) e (Futunic) > a (IfiraMeleM) 0.20786 4.95522(219) 1 (NuclearCordilleran) > o (Ifugao) 0.01254 19.55025(220) k (Ifugao) > q (IfugaoAmga) 0.34057 4.76075(221) k (Ifugao) > q (IfugaoBata) 0.34057 17.99754(222) 1 (Kallahan) > o (IfugaoBayn) 0.01101 17.75466(223) n (Wetar) > N (Iliun) 0.04788 10.14036(224) u (NorthernLuzon) > o (Ilokano) 0.02660 29.66945(225) a (Peripheral Central Bisayan) > u (Ilonggo) 0.12642 4.07172(226) a (SouthernCordilleran) > 1 (Ilongot) 0.07296 10.35510(227) N (Ilongot) > n (IlongotKak) 0.06631 9.13131(228) a (Ivatan) > e (Imorod) 0.08734 6.03794
continued on next page
17
continued from previous pageCode Parent Child Functional load Number of occurrences
(229) n (SouthwestBabar) > m (Imroing) 0.23281 10.35889(230) u (SamaBajaw) > o (Inabaknon) 0.02973 19.36631(231) a (LocalMalay) > e (Indonesian) 0.10667 3.03924(232) u (Benguet) > o (Inibaloi) 0.03849 64.30170(233) y (MaranaoIranon) > i (Iranun) 0.00959 4.31410(234) a (Ivatan) > e (Iraralay) 0.08734 11.08683(235) l (Ivatan) > d (Isamorong) 0.01340 14.95905(236) t (Ibanagic) > s (IsnegDibag) 0.05600 2.91874(237) h (Ivatan) > x (Itbayat) 0.00445 31.26009(238) o (Ivatan) > u (Itbayaten) 5.820e-04 67.10087(239) u (KalingaItneg) > o (ItnegBinon) 0.02352 60.52923(240) l (Ivatan) > d (Ivasay) 0.01340 14.96335(241) g (Bashiic) > y (Ivatan) 0.02574 5.02873(242) e (Ivatan) > 1 (IvatanBasc) 0.00723 32.77267(243) q (ProtoMalay) > h (Javanese) 0.07129 15.44980(244) a (Northern NewCaledonian) > e (Jawe) 0.13215 8.47484(245) e (West Barito) > o (Kadorih) 7.673e-05 3.80522(246) O (SanCristobal) > o (Kahua) 0.00562 1.99971(247) O (SanCristobal) > o (KahuaMami) 0.00562 1.99952(248) l (Gorontalic) > r (Kaidipang) 0.01053 21.87631(249) k (KairiruManam) > q (Kairiru) 0.00897 10.55125(250) u (Schouten) > i (KairiruManam) 0.17315 1.37804(251) n (Ilongot) > N (KakidugenI) 0.06631 9.38577(252) r (Mansakan) > l (Kalagan) 0.04611 3.61546(253) i (MesoPhilippine) > 1 (Kalamian) 0.02195 3.10459(254) 1 (KalingaItneg) > o (KalingaGui) 0.01076 22.89422(255) a (CentralCordilleran) > i (KalingaItneg) 0.13957 2.25237(256) s (Benguet) > h (Kallahan) 0.02667 15.50239(257) s (Kallahan) > h (KallahanKa) 0.00566 1.92678(258) 1 (Kallahan) > e (KallahanKe) 0.00403 38.49811(259) n (BimaSumba) > N (Kambera) 0.00600 25.28750(260) b (Tsouic) > v (Kanakanabu) 0.01715 4.07361(261) v (PatpatarTolai) > w (Kandas) 0.03857 3.91962(262) u (BontokKankanay) > o (KankanayNo) 0.04380 49.14539(263) n (CentralLuzon) > N (Kapampanga) 0.01456 6.07915(264) t (Ellicean) > d (Kapingamar) 3.974e-05 35.94977(265) v (LavongaiNalik) > f (KaraWest) 0.02487 4.21517(266) a (SouthHalmahera) > e (KasiraIrah) 0.05583 5.86174(267) e (South West Barito) > E (Katingan) 0.00426 27.03064(268) I (Pasismanua) > i (KaulongAuV) 0.01894 6.17672(269) a (Northern EastFormosan) > i (Kavalan) 0.08596 4.99960(270) b (ProtoMalay) > v (KayanMurik) 1.449e-07 6.64852(271) a (KayanMurik) > e (KayanUmaJu) 0.06056 5.93656(272) a (SarmiJayapuraBay) > e (KayupulauK) 0.32087 3.93441(273) a (FloresLembata) > e (Kedang) 0.10748 14.89342(274) b (KeiTanimbar) > B (KeiTanimba) 4.198e-05 4.13306(275) a (SoutheastMaluku) > e (KeiTanimbar) 0.17157 1.74318(276) a (Dayic) > e (KelabitBar) 0.07691 10.95832(277) e (East NuclearTimor) > E (Kemak) 2.255e-04 5.94197(278) i (NorthSarawakan) > e (KenyahLong) 0.03423 6.39511(279) a (LocalMalay) > o (Kerinci) 0.01029 37.28182(280) a (KilivilaLouisiades) > e (Kilivila) 0.14778 5.83519(281) o (Peripheral PapuanTip) > a (KilivilaLouisiades) 0.17286 2.23572(282) o (Central SantaIsabel) > O (KilokakaYs) 0.02707 5.60509(283) N (MicronesianProper) > n (Kiribati) 0.04469 10.23912(284) P (Manam) > k (Kis) 0.07162 3.84299(285) t (KisarRoma) > k (Kisar) 0.05336 24.56857(286) f (SouthwestMaluku) > w (KisarRoma) 0.03321 7.75331(287) a (BimaSumba) > o (Kodi) 0.13543 22.04728(288) l (ProtoCentr) > r (KoiwaiIria) 0.06793 11.95360(289) O (Central SantaIsabel) > o (Kokota) 0.02707 31.07342(290) e (Pesisir) > o (Komering) 0.00325 8.58447(291) a (Blaan) > O (KoronadalB) 0.01132 30.81323(292) s (NgeroVitiaz) > r (Kove) 0.06561 7.13972(293) u (PatpatarTolai) > a (Kuanua) 0.16803 1.98361(294) O (West NewGeorgia) > o (Kubokota) 0.00573 2.00145(295) v (Nuclear WestCentralPapuan) > b (Kuni) 0.11533 6.21112(296) Not enough data available for reliable sound change estimates(297) a (MicronesianProper) > e (Kusaie) 0.12251 14.33579(298) O (Northern Malaita) > o (Kwai) 0.00348 1.99952(299) a (Northern Malaita) > o (Kwaio) 0.18875 8.31055(300) l (Tanna) > r (Kwamera) 0.05929 25.71039(301) N (Northern Malaita) > n (KwaraaeSol) 0.03728 9.80996(302) Not enough data available for reliable sound change estimates(303) e (MelanauKajang) > @ (Lahanan) 0.00231 14.95868(304) i (Willaumez) > e (Lakalai) 0.11589 1.41234(305) r (Nuclear WestCentralPapuan) > l (Lala) 0.13489 4.70309(306) u (FloresLembata) > o (LamaholotI) 0.02895 10.53359(307) Not enough data available for reliable sound change estimates(308) s (BimaSumba) > h (Lamboya) 0.03743 9.36744(309) o (Bibling) > u (LamogaiMul) 0.12203 5.67124(310) a (Pesisir) > @ (Lampung) 0.00929 12.80317
continued on next page
18
continued from previous pageCode Parent Child Functional load Number of occurrences
(311) e (ProtoMalay) > a (LandDayak) 0.04499 11.18206(312) Not enough data available for reliable sound change estimates(313) k (Northern Malaita) > g (Lau) 0.07399 10.37078(314) Not enough data available for reliable sound change estimates(315) Not enough data available for reliable sound change estimates(316) r (NewIreland) > l (LavongaiNalik) 0.13377 4.12434(317) a (East Manus) > e (Leipon) 0.13133 15.77294(318) a (Tanna) > e (Lenakel) 0.05393 9.10010(319) Not enough data available for reliable sound change estimates(320) h (Gela) > G (LengoGhaim) 5.895e-05 1.98182(321) Not enough data available for reliable sound change estimates(322) f (SouthwestMaluku) > w (Letinese) 0.03321 8.50360(323) n (West Manus) > N (Levei) 0.18931 43.18724(324) a (LavongaiNalik) > e (LihirSungl) 0.07781 10.60106(325) i (West Manus) > e (Likum) 0.08687 4.08584(326) e (EndeLio) > @ (LioFloresT) 0.00290 10.17585(327) a (Malayan) > e (LocalMalay) 0.09530 18.43396(328) Not enough data available for reliable sound change estimates(329) r (Manus) > P (Loniu) 0.00619 8.93647(330) a (SoutheastIslands) > e (Lou) 0.09902 12.46015(331) i (LocalMalay) > e (LowMalay) 0.03770 2.99404(332) a (RemoteOceanic) > e (LoyaltyIslands) 0.14981 15.79651(333) t (Ellicean) > k (Luangiua) 0.47404 39.73227(334) e (MonoUruava) > a (LungaLunga) 0.13896 3.85273(335) O (West NewGeorgia) > o (Lungga) 0.00573 2.00155(336) O (West NewGeorgia) > o (Luqa) 0.00573 2.00145(337) b (East Barito) > w (Maanyan) 0.07537 6.13215(338) a (NewIreland) > e (Madak) 0.16211 8.90472(339) k (Madak) > g (MadakLamas) 0.05218 9.51948(340) l (LavongaiNalik) > r (Madara) 0.10006 10.95189(341) u (ProtoMalay) > o (Madurese) 0.01585 39.16274(342) l (CentralPapuan) > r (MagoriSout) 0.12605 3.95178(343) a (Nuclear PapuanTip) > i (Maisin) 0.22933 2.17692(344) n (SouthSulawesi) > N (Makassar) 0.18370 10.70075(345) k (MalaitaSanCristobal) > P (Malaita) 0.09152 5.02310(346) v (SoutheastSolomonic) > f (MalaitaSanCristobal) 0.06368 25.62607(347) Not enough data available for reliable sound change estimates(348) N (LocalMalay) > n (MalayBahas) 0.17169 34.31949(349) p (Malayic) > m (Malayan) 0.17398 3.33605(350) q (ProtoMalay) > h (Malayic) 0.07129 31.25704(351) a (Vanuatu) > e (MalekulaCentral) 0.08559 25.93669(352) a (NortheastVanuatuBanksIslands) > e (MalekulaCoastal) 0.08702 31.59187(353) a (Vitiaz) > o (Maleu) 0.11223 11.90975(354) e (Bugis) > a (Maloh) 0.07175 4.70755(355) u (CentralPhilippine) > o (Mamanwa) 0.01883 38.86565(356) u (East NuclearTimor) > a (Mambai) 0.34369 4.99285(357) e (BimaSumba) > a (Mamboru) 0.13262 7.46411(358) k (KairiruManam) > P (Manam) 0.01548 9.58872(359) a (Marquesic) > e (Mangareva) 0.26245 3.72483(360) s (BimaSumba) > c (Manggarai) 2.420e-05 8.94154(361) r (Tahitic) > l (Manihiki) 9.361e-04 13.67126(362) N (SouthernPhilippine) > n (Manobo) 0.07976 13.17836(363) 1 (AtaTigwa) > o (ManoboAtad) 0.03723 66.54667(364) 1 (AtaTigwa) > o (ManoboAtau) 0.03723 66.57068(365) 1 (Central Manobo) > a (ManoboDiba) 0.12386 3.76411(366) g (West Central Manobo) > h (ManoboIlia) 0.03983 13.28241(367) a (South Manobo) > 1 (ManoboKala) 0.12673 7.79055(368) 1 (South Manobo) > 2 (ManoboSara) 2.271e-04 58.20252(369) o (AtaTigwa) > 1 (ManoboTigw) 0.03723 11.93633(370) a (West Central Manobo) > 1 (ManoboWest) 0.13743 4.54113(371) l (Mansakan) > r (Mansaka) 0.04611 18.44830(372) o (CentralPhilippine) > u (Mansakan) 0.01883 21.46241(373) a (Eastern AdmiraltyIslands) > e (Manus) 0.13652 4.75105(374) v (Tahitic) > w (Maori) 0.00255 12.29480(375) l (BorneoCoastBajaw) > w (Mapun) 0.03973 7.82813(376) u (MaranaoIranon) > o (Maranao) 0.00377 54.21909(377) 1 (SouthernPhilippine) > e (MaranaoIranon) 0.00422 7.67944(378) Not enough data available for reliable sound change estimates(379) Not enough data available for reliable sound change estimates(380) Not enough data available for reliable sound change estimates(381) Not enough data available for reliable sound change estimates(382) s (East NewGeorgia) > c (Marovo) 0.00144 4.87889(383) a (Marquesic) > e (Marquesan) 0.26245 10.48977(384) f (Central East Nuclear Polynesian) > h (Marquesic) 0.05235 2.06222(385) a (MicronesianProper) > e (Marshalles) 0.12251 42.07442(386) i (South Babar) > E (MaselaSouthBabar) 0.04148 3.02753(387) e (Northern NuclearBel) > i (Matukar) 0.04103 3.86380(388) t (Willaumez) > f (Maututu) 0.00143 5.97137(389) Not enough data available for reliable sound change estimates(390) Not enough data available for reliable sound change estimates(391) Not enough data available for reliable sound change estimates(392) Not enough data available for reliable sound change estimates
continued on next page
19
continued from previous pageCode Parent Child Functional load Number of occurrences
(393) Not enough data available for reliable sound change estimates(394) e (Northern NuclearBel) > i (Megiar) 0.04103 3.96950(395) b (Nuclear WestCentralPapuan) > p (Mekeo) 0.01057 10.53926(396) e (Northwest) > a (MelanauKajang) 0.05269 4.10012(397) a (MelanauKajang) > e (MelanauMuk) 0.05542 8.30760(398) N (LocalMalay) > n (Melayu) 0.17169 15.57376(399) e (LocalMalay) > a (MelayuBrun) 0.10667 57.81798(400) Not enough data available for reliable sound change estimates(401) r (Vitiaz) > l (Mengen) 0.09236 7.76084(402) a (WestSanto) > e (Merei) 0.12570 4.40581(403) u (East Barito) > o (MerinaMala) 0.00538 45.42570(404) p (WesternOceanic) > b (MesoMelanesian) 0.06671 3.25436(405) e (ProtoMalay) > 1 (MesoPhilippine) 0.00192 27.69856(406) s (ProtoMicro) > t (MicronesianProper) 0.14436 10.32542(407) e (Malayan) > a (Minangkaba) 0.09530 36.65853(408) b (RajaAmpat) > p (Minyaifuin) 0.08190 5.44290(409) r (KilivilaLouisiades) > l (Misima) 0.09569 2.85853(410) u (Malayic) > o (Moken) 0.00621 18.67524(411) a (Ponapeic) > O (Mokilese) 0.00863 20.21510(412) k (Bwaidoga) > P (Molima) 0.02226 6.39306(413) r (MonoUruava) > l (Mono) 0.18018 7.90990(414) Not enough data available for reliable sound change estimates(415) Not enough data available for reliable sound change estimates(416) N (SouthNewIrelandNorthwestSolomonic) > n (MonoUruava) 0.07198 4.50934(417) t (CenderawasihBay) > P (Mor) 0.00191 7.90698(418) a (Sulawesi) > o (Mori) 0.06991 15.10951(419) d (ProtoChuuk) > t (Mortlockes) 0.10821 17.95820(420) v (EastVanuatu) > w (Mota) 0.04447 4.78867(421) v (SinagoroKeapara) > h (Motu) 0.01870 13.39891(422) r (Bibling) > x (Mouk) 3.176e-05 16.89672(423) u (Sulawesi) > o (MunaButon) 0.04575 3.73293(424) N (Western Munic) > n (MunaKatobu) 2.731e-04 6.02727(425) d (UlatInai) > r (MurnatenAl) 0.00181 5.95393(426) r (ProtoOcean) > l (Mussau) 0.16171 3.95147(427) a (EastVanuatu) > E (Mwotlap) 0.01111 27.83527(428) r (EndeLio) > z (Nage) 0.02129 1.96583(429) i (MalekulaCoastal) > e (Nahavaq) 0.05885 9.63004(430) n (Willaumez) > l (NakanaiBil) 0.00412 1.02690(431) a (LavongaiNalik) > @ (Nalik) 0.00211 12.96772(432) u (CentralVanuatu) > i (Namakir) 0.18127 21.52549(433) a (MalekulaCentral) > e (Naman) 0.13582 33.25250(434) b (MalekulaCoastal) > p (Nati) 0.01049 19.54611(435) r (SoutheastIslands) > l (Nauna) 0.10938 5.60587(436) a (ProtoMicro) > e (Nauru) 0.13661 5.67814(437) Not enough data available for reliable sound change estimates(438) s (NehanNorthBougainville) > h (Nehan) 0.05608 13.16549(439) N (Nehan) > n (NehanHape) 0.11803 18.93949(440) a (SouthNewIrelandNorthwestSolomonic) > o (NehanNorthBougainville) 0.16106 4.34745(441) e (Northern NewCaledonian) > a (Nelemwa) 0.13215 10.43158(442) o (Utupua) > œ (Nembao) 0.01621 4.98208(443) a (LoyaltyIslands) > e (Nengone) 0.20861 5.30862(444) m (Vanuatu) > n (Nese) 0.35703 18.91500(445) a (MalekulaCentral) > e (Neveei) 0.13582 34.11813(446) e (LoyaltyIslands) > a (NewCaledonian) 0.20861 1.71479(447) k (SouthNewIrelandNorthwestSolomonic) > g (NewGeorgia) 0.08381 5.00348(448) e (MesoMelanesian) > i (NewIreland) 0.06281 3.31247(449) w (BimaSumba) > v (Ngadha) 4.053e-05 14.09575(450) i (Aru) > e (NgaiborSAr) 0.02299 6.77784(451) f (NorthNewGuinea) > w (NgeroVitiaz) 0.01667 3.53470(452) Not enough data available for reliable sound change estimates(453) s (Gela) > h (Nggela) 0.05008 17.07009(454) b (CentralVanuatu) > p (Nguna) 0.06113 6.26520(455) p (Sumatra) > f (Nias) 0.00205 8.90278(456) f (NilaSerua) > h (Nila) 0.01125 4.39065(457) t (TeunNilaSerua) > l (NilaSerua) 0.13648 1.88492(458) u (Nehan) > w (Nissan) 0.02937 8.95963(459) a (Tongic) > e (Niue) 0.18204 4.31952(460) l (North Babar) > n (NorthBabar) 0.16784 6.84401(461) R (ProtoCentr) > r (NorthBomberai) 0.02692 10.12083(462) v (WesternOceanic) > p (NorthNewGuinea) 0.12924 8.71798(463) a (Nuclear PapuanTip) > e (NorthPapuanMainlandDEntrecasteaux) 0.27192 3.13375(464) a (Northwest) > e (NorthSarawakan) 0.05269 5.44048(465) e (Babar) > E (North Babar) 0.03978 1.27682(466) b (Sulawesi) > w (North Minahasan) 0.05030 3.03447(467) u (Vanuatu) > o (NortheastVanuatuBanksIslands) 0.06513 4.83269(468) r (NorthernLuzon) > g (NorthernCordilleran) 0.01688 4.17784(469) m (NorthernPhilippine) > n (NorthernLuzon) 0.15955 5.87812(470) N (ProtoMalay) > n (NorthernPhilippine) 0.10751 18.96525(471) o (NorthernCordilleran) > u (Northern Dumagat) 0.01648 1.25949(472) l (EastFormosan) > n (Northern EastFormosan) 0.14981 5.83985(473) p (Malaita) > b (Northern Malaita) 0.00727 4.99257(474) a (NewCaledonian) > o (Northern NewCaledonian) 0.17212 3.00312
continued on next page
20
continued from previous pageCode Parent Child Functional load Number of occurrences
(475) b (Bel) > p (Northern NuclearBel) 0.08564 2.04429(476) N (Sangiric) > n (Northern Sangiric) 0.10504 18.92173(477) q (ProtoMalay) > P (Northwest) 0.04087 30.80231(478) l (CentralCordilleran) > k (NuclearCordilleran) 0.14865 1.01516(479) b (Timor) > f (NuclearTimor) 0.08459 2.80832(480) l (PapuanTip) > n (Nuclear PapuanTip) 0.10099 4.22343(481) N (Polynesian) > n (Nuclear Polynesian) 0.01742 5.34206(482) r (WestCentralPapuan) > l (Nuclear WestCentralPapuan) 0.13055 1.52585(483) t (Ellicean) > d (Nukuoro) 3.974e-05 48.84748(484) f (HuonGulf) > w (NumbamiSib) 0.03605 5.99986(485) t (CenderawasihBay) > k (Numfor) 0.18043 10.67064(486) f (Seram) > h (Nunusaku) 0.02050 9.64666(487) a (LocalMalay) > @ (Ogan) 0.00113 22.90283(488) N (Javanese) > n (OldJavanes) 0.12968 22.81888(489) t (CentralVanuatu) > r (Orkon) 0.18595 20.59925(490) O (Southern Malaita) > o (Oroha) 0.01151 1.99952(491) r (EastVanuatu) > l (PaameseSou) 0.17094 18.23342(492) s (Formosan) > t (Paiwan) 0.10317 11.06529(493) a (ProtoMalay) > e (Palauan) 0.04499 11.70163(494) 1 (Palawano) > @ (PalawanBat) 4.146e-04 36.88428(495) 1 (MesoPhilippine) > u (Palawano) 0.02599 5.66511(496) Not enough data available for reliable sound change estimates(497) u (Pangasinic) > o (Pangasinan) 0.03899 25.70031(498) Not enough data available for reliable sound change estimates(499) a (WesternOceanic) > o (PapuanTip) 0.15356 3.56136(500) N (SouthwestNewBritain) > n (Pasismanua) 0.09977 3.11319(501) v (PatpatarTolai) > h (Patpatar) 0.00182 8.75307(502) a (SouthNewIrelandNorthwestSolomonic) > e (PatpatarTolai) 0.15474 6.20403(503) e (SeramStraits) > i (Paulohi) 0.18832 3.98409(504) u (Formosan) > o (Pazeh) 2.254e-04 6.81642(505) f (Tahitic) > h (Penrhyn) 0.02693 5.16875(506) n (Wetar) > N (Perai) 0.04788 4.13934(507) u (Central Bisayan) > o (Peripheral Central Bisayan) 0.02827 1.93754(508) e (PapuanTip) > a (Peripheral PapuanTip) 0.19334 3.54732(509) q (ProtoMalay) > h (Pesisir) 0.07129 14.80076(510) v (EastVanuatu) > f (PeteraraMa) 0.01242 17.18743(511) a (ChamChru) > 1 (PhanRangCh) 0.00636 12.19166(512) Not enough data available for reliable sound change estimates(513) v (EastFijianPolynesian) > f (Polynesian) 0.05871 17.52595(514) a (Ponapeic) > e (Ponapean) 0.16942 10.52294(515) a (PonapeicTrukic) > e (Ponapeic) 0.13099 19.78210(516) t (MicronesianProper) > d (PonapeicTrukic) 0.04291 17.55398(517) e (BimaSumba) > a (Pondok) 0.13262 6.53158(518) B (TukangbesiBonerate) > F (Popalia) 0.00989 10.98695(519) e (CentralEastern) > @ (ProtoCentr) 0.00248 3.97165(520) o (PonapeicTrukic) > a (ProtoChuuk) 0.09341 2.83945(521) N (ProtoAustr) > n (ProtoMalay) 0.13485 12.99231(522) v (RemoteOceanic) > f (ProtoMicro) 0.04980 13.02087(523) a (EasternMalayoPolynesian) > o (ProtoOcean) 0.09981 7.99270(524) f (SamoicOutlier) > w (Pukapuka) 0.00204 8.90305(525) a (NorthBomberai) > e (PulauArgun) 0.13318 12.93777(526) u (ProtoChuuk) > U (PuloAnna) 8.595e-06 28.37331(527) d (ProtoChuuk) > t (PuloAnnan) 0.10821 26.92081(528) d (ProtoChuuk) > t (Puluwatese) 0.10821 32.33646(529) u (KayanMurik) > o (PunanKelai) 0.00695 11.47865(530) b (Formosan) > v (Puyuma) 0.02080 5.97573(531) s (EastVanuatu) > h (Raga) 0.03550 19.03806(532) r (CenderawasihBay) > l (RajaAmpat) 0.10632 10.89494(533) f (East Nuclear Polynesian) > h (RapanuiEas) 0.05254 6.51440(534) a (Tahitic) > e (Rarotongan) 0.23947 2.99725(535) a (ProtoMalay) > @ (RejangReja) 0.00259 33.94479(536) i (CentralEasternOceanic) > a (RemoteOceanic) 0.30671 1.87121(537) r (Futunic) > g (Rennellese) 0.01560 46.78448(538) O (Choiseul) > o (Ririo) 0.01953 8.10758(539) r (NorthNewGuinea) > z (Riwo) 0.01094 1.23968(540) r (KisarRoma) > R (Roma) 0.00324 9.86839(541) k (Nuclear WestCentralPapuan) > h (Roro) 0.04267 7.79458(542) f (West NuclearTimor) > b (RotiTerman) 0.05510 4.89359(543) t (WestFijianRotuman) > f (Rotuman) 0.07237 18.85436(544) O (West NewGeorgia) > o (Roviana) 0.00573 6.86288(545) u (Formosan) > o (Rukai) 2.254e-04 9.26163(546) k (Tahitic) > P (Rurutuan) 0.00737 41.17228(547) u (Palawano) > o (SWPalawano) 7.014e-04 10.93716(548) f (Southern Malaita) > h (Saa) 0.00833 23.36029(549) Not enough data available for reliable sound change estimates(550) O (Southern Malaita) > o (SaaSaaVill) 0.01151 2.00000(551) O (Southern Malaita) > o (SaaUkiNiMa) 0.01151 1.99961(552) Not enough data available for reliable sound change estimates(553) u (Tsouic) > o (Saaroa) 0.00593 29.89765(554) a (Northwest) > o (Sabahan) 0.01254 7.65235(555) d (ProtoChuuk) > t (SaipanCaro) 0.10821 30.24687(556) u (Formosan) > o (Saisiat) 2.254e-04 32.49754
continued on next page
21
continued from previous pageCode Parent Child Functional load Number of occurrences
(557) a (Vanuatu) > E (SakaoPortO) 1.415e-05 7.37912(558) v (Suauic) > h (Saliba) 0.01548 5.59439(559) e (ProtoMalay) > a (SamaBajaw) 0.04499 11.37100(560) a (SuluBorneo) > 1 (SamalSiasi) 0.03121 4.55207(561) u (CentralLuzon) > o (SambalBoto) 0.03570 51.44942(562) k (SamoicOutlier) > P (Samoan) 7.913e-05 40.57180(563) r (Nuclear Polynesian) > l (SamoicOutlier) 0.04761 2.52276(564) Not enough data available for reliable sound change estimates(565) h (Northern Sangiric) > r (SangilSara) 0.01859 10.57112(566) q (Northern Sangiric) > P (Sangir) 0.17663 33.66747(567) P (Northern Sangiric) > q (SangirTabu) 0.17663 5.11841(568) a (Sulawesi) > e (Sangiric) 0.06360 9.41251(569) l (SanCristobal) > r (SantaAna) 0.20803 20.89020(570) O (SanCristobal) > o (SantaCatal) 0.00562 1.99971(571) s (SouthNewIrelandNorthwestSolomonic) > h (SantaIsabel) 0.01828 6.35726(572) l (NehanNorthBougainville) > n (SaposaTinputz) 0.13005 8.23827(573) a (Blaan) > u (SaranganiB) 0.10239 1.65720(574) o (SarmiJayapuraBay) > a (Sarmi) 0.30507 3.74591(575) l (NorthNewGuinea) > r (SarmiJayapuraBay) 0.09033 9.81639(576) a (BaliSasak) > @ (Sasak) 0.07763 7.22450(577) a (ProtoChuuk) > e (Satawalese) 0.11108 19.48892(578) a (BimaSumba) > e (Savu) 0.13262 10.66529(579) n (NorthNewGuinea) > N (Schouten) 0.12461 3.55253(580) a (Atayalic) > u (Sediq) 0.15623 4.94540(581) f (Western AdmiraltyIslands) > h (Seimat) 0.03838 5.52135(582) u (NorthBomberai) > i (Sekar) 0.07251 14.57543(583) b (SoutheastMaluku) > h (Selaru) 0.00154 5.27609(584) Not enough data available for reliable sound change estimates(585) E (Pasismanua) > e (Sengseng) 0.00635 6.51904(586) r (East CentralMaluku) > l (Seram) 0.17834 9.26050(587) l (Nunusaku) > r (SeramStraits) 0.08036 17.65607(588) e (MaselaSouthBabar) > E (Serili) 0.03391 27.97054(589) f (NilaSerua) > w (Serua) 0.09929 7.71214(590) N (PatpatarTolai) > n (Siar) 0.11079 7.91352(591) n (FloresLembata) > N (Sika) 0.12345 10.33270(592) f (Ellicean) > h (Sikaiana) 0.01500 18.73830(593) O (West NewGeorgia) > o (Simbo) 0.00573 3.68466(594) a (CentralPapuan) > o (SinagoroKeapara) 0.25340 1.00322(595) a (LandDayak) > o (Singhi) 0.01654 17.32791(596) u (EastFormosan) > o (Siraya) 4.190e-04 8.28489(597) O (Choiseul) > o (Sisingga) 0.01953 13.70360(598) Not enough data available for reliable sound change estimates(599) r (BimaSumba) > z (Soa) 0.00401 6.96601(600) a (Sarmi) > e (Sobei) 0.23073 10.06460(601) k (CentralMaluku) > P (Soboyo) 0.00549 7.46123(602) a (NehanNorthBougainville) > e (Solos) 0.10574 4.14147(603) Not enough data available for reliable sound change estimates(604) l (Ambon) > r (SouAmanaTe) 0.03275 3.58352(605) r (NorthernLuzon) > l (SouthCentralCordilleran) 0.03231 7.74272(606) n (MaselaSouthBabar) > l (SouthEastB) 0.25859 6.79647(607) a (CentralVanuatu) > e (SouthEfate) 0.06242 9.08441(608) v (SouthHalmaheraWestNewGuinea) > p (SouthHalmahera) 0.04413 6.86021(609) u (EasternMalayoPolynesian) > i (SouthHalmaheraWestNewGuinea) 0.15907 2.91215(610) i (NewIreland) > u (SouthNewIrelandNorthwestSolomonic) 0.17058 3.74428(611) u (Sulawesi) > o (SouthSulawesi) 0.04575 8.90368(612) t (Babar) > k (South Babar) 0.08233 5.96986(613) l (Bisayan) > y (South Bisayan) 0.06356 5.93967(614) a (Manobo) > 1 (South Manobo) 0.09378 14.76928(615) y (West Barito) > i (South West Barito) 0.00602 4.16231(616) o (Eastern AdmiraltyIslands) > u (SoutheastIslands) 0.01939 2.52762(617) R (ProtoCentr) > r (SoutheastMaluku) 0.02692 16.51716(618) r (CentralEasternOceanic) > l (SoutheastSolomonic) 0.18698 6.93482(619) t (SouthCentralCordilleran) > b (SouthernCordilleran) 0.17112 1.97141(620) e (ProtoMalay) > 1 (SouthernPhilippine) 0.00192 17.69275(621) l (Malaita) > n (Southern Malaita) 0.09412 3.02851(622) o (South Babar) > u (SouthwestBabar) 0.05720 5.18676(623) b (Timor) > w (SouthwestMaluku) 0.05085 6.39151(624) i (Vitiaz) > u (SouthwestNewBritain) 0.23570 3.84871(625) u (Atayalic) > o (SquliqAtay) 0.00681 13.92923(626) k (Suauic) > P (Suau) 0.02509 20.06806(627) r (Nuclear PapuanTip) > l (Suauic) 0.04206 7.42557(628) 1 (Subanun) > o (SubanonSio) 0.03138 42.58704(629) a (SouthernPhilippine) > 1 (Subanun) 0.05979 13.92604(630) g (Subanun) > d (SubanunSin) 0.15626 6.97087(631) e (ProtoMalay) > a (Sulawesi) 0.04499 11.63780(632) u (SamaBajaw) > o (SuluBorneo) 0.02973 3.89781(633) e (ProtoMalay) > o (Sumatra) 0.00512 23.81901(634) N (ProtoMalay) > n (Sunda) 0.10751 21.82474(635) u (South Bisayan) > o (Surigaonon) 0.03947 24.45484(636) a (Erromanga) > e (SyeErroman) 0.11628 12.15031(637) t (SouthSulawesi) > P (TaeSToraja) 0.06897 3.10286(638) N (Tboli) > n (Tagabili) 0.10214 23.38929
continued on next page
22
continued from previous pageCode Parent Child Functional load Number of occurrences
(639) u (CentralPhilippine) > o (TagalogAnt) 0.01883 10.16369(640) k (Kalamian) > q (TagbanwaAb) 0.42879 5.41452(641) q (Kalamian) > k (TagbanwaKa) 0.42879 17.73591(642) u (Tahitic) > o (Tahiti) 0.19238 6.98869(643) e (Tahitic) > a (TahitianMo) 0.23947 3.68949(644) a (Tahitic) > e (Tahitianth) 0.23947 2.71372(645) f (Central East Nuclear Polynesian) > h (Tahitic) 0.05235 6.11830(646) v (SaposaTinputz) > f (Taiof) 0.04497 6.79178(647) l (Ellicean) > r (Takuu) 0.01827 30.88407(648) Not enough data available for reliable sound change estimates(649) Not enough data available for reliable sound change estimates(650) Not enough data available for reliable sound change estimates(651) Not enough data available for reliable sound change estimates(652) h (Guadalcanal) > G (TalisePole) 5.670e-05 1.97576(653) f (Wetar) > h (Talur) 0.03346 6.74401(654) e (Vanikoro) > a (Tanema) 0.28248 10.81075(655) a (PatpatarTolai) > e (Tanga) 0.10990 12.52910(656) e (Utupua) > u (Tanimbili) 0.10880 6.71919(657) a (Vanuatu) > @ (Tanna) 0.01056 13.80984(658) r (Tanna) > l (TannaSouth) 0.05929 26.23479(659) a (Vanuatu) > e (Tape) 0.08559 36.62961(660) Not enough data available for reliable sound change estimates(661) f (Sarmi) > p (Tarpia) 0.09285 4.76042(662) o (ButuanTausug) > u (TausugJolo) 0.03983 38.80244(663) Not enough data available for reliable sound change estimates(664) a (Bilic) > O (Tboli) 0.01846 18.79298(665) O (Tboli) > o (TboliTagab) 0.02446 2.78683(666) O (Vanikoro) > o (Teanu) 0.13019 24.94671(667) l (SouthwestBabar) > n (TelaMasbua) 0.17076 14.35748(668) s (SaposaTinputz) > h (Teop) 0.06456 11.64051(669) N (East NuclearTimor) > n (TetunTerik) 0.02632 3.88184(670) t (TeunNilaSerua) > P (Teun) 0.01477 8.79107(671) e (SouthwestMaluku) > E (TeunNilaSerua) 0.00128 6.36734(672) b (WesternPlains) > f (Thao) 0.01723 9.66201(673) u (LavongaiNalik) > a (Tiang) 0.23417 7.41175(674) a (LavongaiNalik) > o (Tigak) 0.10194 2.85222(675) r (Futunic) > l (Tikopia) 0.05835 3.75175(676) R (ProtoCentr) > r (Timor) 0.02692 17.70656(677) e (Dayic) > o (TimugonMur) 0.00467 20.83722(678) s (Northern Malaita) > 8 (Toambaita) 0.01236 5.00367(679) k (Sumatra) > h (TobaBatak) 0.01209 10.18636(680) s (SamoicOutlier) > h (Tokelau) 0.02620 9.71242(681) g (Guadalcanal) > h (Tolo) 0.04461 9.43634(682) a (Tongic) > o (Tongan) 0.28401 6.49634(683) s (Polynesian) > h (Tongic) 0.04938 24.85227(684) l (North Minahasan) > d (Tonsea) 0.05604 2.89546(685) b (North Minahasan) > w (Tontemboan) 0.12446 9.61430(686) n (MonoUruava) > l (Torau) 0.17242 4.80739(687) a (Tsouic) > o (Tsou) 0.00357 26.52983(688) d (Formosan) > c (Tsouic) 0.02777 7.43830(689) n (Tahitic) > N (Tuamotu) 0.00257 17.01166(690) a (Wetar) > i (Tugun) 0.34504 3.34154(691) a (MunaButon) > o (TukangbesiBonerate) 0.19256 6.52913(692) a (LavongaiNalik) > e (TungagTung) 0.07781 8.41577(693) u (Barito) > o (Tunjung) 0.00125 6.26453(694) s (Ellicean) > h (Tuvalu) 0.01029 12.42513(695) p (Are) > f (Ubir) 0.00167 1.99937(696) Not enough data available for reliable sound change estimates(697) p (Aru) > f (UjirNAru) 0.01141 10.59153(698) h (Nunusaku) > b (UlatInai) 0.07007 6.57134(699) o (Erromanga) > e (Ura) 0.05761 9.45115(700) l (MonoUruava) > r (Uruava) 0.18018 8.71608(701) a (EasternOuterIslands) > o (Utupua) 0.19429 6.30089(702) s (SamoicOutlier) > h (UveaEast) 0.02620 28.16233(703) r (Futunic) > l (UveaWest) 0.05835 27.69390(704) r (Futunic) > l (VaeakauTau) 0.05835 41.42212(705) u (Choiseul) > @ (Vaghua) 3.654e-05 19.13794(706) Not enough data available for reliable sound change estimates(707) a (EasternOuterIslands) > e (Vanikoro) 0.33088 4.43031(708) o (Vanikoro) > e (Vano) 0.10677 4.13392(709) o (RemoteOceanic) > u (Vanuatu) 0.11711 14.34763(710) o (Choiseul) > O (Varisi) 0.01953 6.43868(711) Not enough data available for reliable sound change estimates(712) r (SinagoroKeapara) > l (Vilirupu) 0.17037 8.15153(713) a (NgeroVitiaz) > e (Vitiaz) 0.13267 4.47748(714) s (MesoMelanesian) > d (Vitu) 0.02619 6.27238(715) t (HuonGulf) > r (Wampar) 0.06174 5.81284(716) s (BimaSumba) > h (Wanukaka) 0.03743 14.17816(717) Not enough data available for reliable sound change estimates(718) v (CenderawasihBay) > w (Waropen) 0.08865 4.74200(719) r (GeserGorom) > l (Watubela) 0.17521 13.68799(720) l (AreTaupota) > r (Wedau) 0.04316 3.56623
continued on next page
23
continued from previous pageCode Parent Child Functional load Number of occurrences
(721) i (BimaSumba) > e (WejewaTana) 0.04706 11.31057(722) u (Seram) > v (Werinama) 3.739e-05 7.78342(723) t (CentralPapuan) > k (WestCentralPapuan) 0.13745 14.10036(724) a (ProtoCentr) > o (WestDamar) 0.02891 9.89554(725) f (CentralPacific) > h (WestFijianRotuman) 0.00402 4.95883(726) k (NortheastVanuatuBanksIslands) > h (WestSanto) 0.01801 2.59859(727) a (Barito) > e (West Barito) 0.06510 2.67923(728) a (Central Manobo) > 1 (West Central Manobo) 0.12386 21.47034(729) a (Manus) > e (West Manus) 0.16506 7.10396(730) O (NewGeorgia) > o (West NewGeorgia) 0.00631 2.83443(731) r (NuclearTimor) > l (West NuclearTimor) 0.14621 3.15074(732) Not enough data available for reliable sound change estimates(733) 1 (West Central Manobo) > e (WesternBuk) 4.647e-04 76.71920(734) s (WestFijianRotuman) > c (WesternFij) 0.00926 4.99871(735) Not enough data available for reliable sound change estimates(736) a (ProtoOcean) > e (WesternOceanic) 0.12139 2.58873(737) d (Formosan) > s (WesternPlains) 0.04913 4.48311(738) s (AdmiraltyIslands) > h (Western AdmiraltyIslands) 0.01688 4.34207(739) a (MunaButon) > o (Western Munic) 0.19256 11.48592(740) t (SouthwestMaluku) > k (Wetar) 0.09657 3.25493(741) r (MesoMelanesian) > l (Willaumez) 0.14060 7.57978(742) b (CentralWestern) > v (WindesiWan) 0.16193 3.04357(743) P (Manam) > k (Wogeo) 0.07162 8.86144(744) k (ProtoChuuk) > g (Woleai) 0.00179 35.44518(745) a (ProtoChuuk) > e (Woleaian) 0.11108 60.39379(746) a (Sulawesi) > o (Wolio) 0.06991 8.71005(747) w (Western Munic) > v (Wuna) 0.00508 2.73532(748) t (Western AdmiraltyIslands) > P (Wuvulu) 0.00203 11.89389(749) a (HuonGulf) > e (Yabem) 0.13370 6.51629(750) a (SamaBajaw) > e (Yakan) 0.04299 27.73964(751) a (KeiTanimbar) > e (Yamdena) 0.18822 17.50706(752) o (Bashiic) > u (Yami) 0.00368 5.79971(753) a (ProtoOcean) > i (Yapese) 0.27352 4.64937(754) a (Javanese) > e (Yogya) 0.08208 4.09979(755) o (West SantaIsabel) > O (ZabanaKia) 0.02351 13.69682
24
Figure S.1: Percentage of words with varying levels of Levenshtein distance. Known Cognates (gold) were hand-annotated by linguists, while Automatic Cognates were found by our system.
25
(521)
( 559)
( 750)Yakan ( 632) ( 91) ( 40)Bajo ( 375)Mapun
( 560)
SamalS ias i
( 230)
Inabaknon ( 620)
( 629) ( 630)
SubanunSin
( 628)
SubanonSio ( 362) ( 123)
( 728) ( 733)
Wes ternBuk ( 370)
ManoboWes t
( 366)
ManoboIlia
( 365)
ManoboDiba
( 27) ( 363)
ManoboAtad
( 364)
ManoboAtau
( 369)
ManoboTigw
( 614)
( 367)
ManoboKala
( 368)
ManoboSara
( 79)
Binukid
( 377)
( 376)
Maranao
( 233)
Iranun
( 43)
( 576)
Sasak
( 42)
Bali
( 405) ( 118)
( 355)
Mamanwa
( 80) ( 121) ( 507)
( 225)
Ilonggo
( 210)
Hiligaynon
( 717)
WarayWaray
( 613)
( 102)
( 103)
Butuanon
( 662)
TausugJolo
( 635)
Surigaonon
( 3)
AklanonBis
( 107)
Cebuano
( 372)
( 252)
Kalagan
( 371)
Mansaka
( 639)
TagalogAnt
( 69)
BikolNagaC
( 253)
( 641)
TagbanwaKa
( 640)
TagbanwaAb
( 495)
( 57)
BatakPalaw
( 494)
PalawanBat
( 47)
Banggi
( 547)
SWPalawano
( 208)
Hanunoo
( 493)
Palauan
( 311)
( 595)
S inghi
( 136)
DayakBakat
( 52)
( 727)
( 615)
( 267)
Katingan
( 137)
DayakNgaju
( 245)
Kadorih
( 151)
( 403)
MerinaM
ala
( 337)
Maanyan
( 693)
Tunjung
( 350)
( 349)
( 327)
( 279)
Kerinc i
( 400)
MelayuSara
( 398)
Melayu
( 487)
Ogan
( 331)
LowMalay
( 231)
Indones ian
( 48)
BanjareseM
( 348)
MalayBahas
( 399)
MelayuBrun
( 407)
Minangkaba
( 126)
( 125)
( 511)
PhanRangCh
( 130)
Chru
( 206)
HainanCham
( 410)
Moken
( 215)
Iban
( 187)
Gayo
( 477)
( 554)
( 217)
Idaan
( 99)
BunduDusun
( 464)
( 138)
( 677)
TimugonM
ur
( 276)
KelabitBar
( 78)
Bintulu
( 66)
( 65)
Bera
wanLon
( 62)
Bela
it
( 278)
KenyahLong
( 396)
( 397)
Mela
nauM
uk
( 303)
Lahanan
( 633)
( 167)
( 169)
EngganoM
al
( 168)
EngganoBan
( 455)
Nia
s
( 679)
TobaBata
k
( 341)
Madure
se
( 470) ( 56)
( 55)
( 241)
( 242)
Ivata
nBasc
( 237)
Itbayat
( 39)
Babuyan
( 234)
Irara
lay
( 238)
Itbayate
n
( 235)
Isam
oro
ng
( 240)
Ivasay
( 228)
Imoro
d
( 752)
Yam
i
( 113)
( 561)
Sam
balB
oto
( 263)
Kapam
panga
( 469)
( 605)
( 619)
( 498)
( 64)
( 256)
( 222)Ifu
gaoBayn
( 257)Kalla
hanKa
( 258)Kalla
hanKe
( 232)
Inib
alo
i
( 497)
Pangasin
an
( 226)
( 251)
Kakid
ugenI
( 227)
IlongotK
ak
( 110)
( 478)
( 219)
( 220)
IfugaoAm
ga
( 221)
IfugaoBata
( 90)
( 88)
( 89)Bonto
kGuin
( 87)Bonto
cGuin
( 262)
KankanayNo
( 41)
Bala
ngaw
( 255)
( 254)
Ka
ling
aG
ui
( 239)
Itne
gB
ino
n
( 468)
( 216)
( 184)
Ga
dd
an
g
( 30)
Atta
Pa
mp
lo
( 236)
Isn
eg
Dib
ag
( 471)
( 2)
Ag
ta
( 144)
Du
ma
ga
tCa
s
( 224)
Ilok
an
o
( 243)
( 754)
Yo
gy
a
( 488)
Old
Jav
an
es
( 735)
We
ste
rnM
ala
yo
Po
lyn
es
ian
(631)
( 611)
( 94)
( 93)
Bu
gin
es
eS
o
( 354)
Ma
loh
( 344)
Ma
ka
ss
ar
( 637)
Ta
eS
To
raja
(568)
(476)
(567)
Sa
ng
irTa
bu
(566)
Sa
ng
ir
(565)
Sa
ng
ilSa
ra
(50)
Ba
ntik
(203)
(201)
(248)
Ka
idip
an
g
(202)
Go
ron
talo
H
(84)
Bo
laa
ng
Mo
n
(423)
(691)
(518)
Po
pa
lia (85)
Bo
ne
rate
(739)
(747)
Wu
na
(424)
Mu
na
Ka
tob
u
(466)
(685)
To
nte
mb
oa
n (684)
To
ns
ea
(46)
Ba
ng
ga
iWd
i (51)
Ba
ree
(746)
Wo
lio (418)
Mo
ri
(270)
(271)
Ka
ya
nU
ma
Ju
(96)
Bu
ka
t
(529)
Pu
na
nK
ela
i
(
111)
(158)
(523)
(736)
(462)
(213)
(715)
Wa
mp
ar
(749)
Ya
be
m (484)
Nu
mb
am
iSib
(451) (7
13)
(624)
(67)
(422)
Mo
uk
(20)
Aria
(309)
La
mo
ga
iMu
l
(7)
Am
ara
(500)
(585)
Se
ng
se
ng
(268)
Ka
ulo
ng
Au
V
(61) (475)
(73)
Bilib
il (394)
Me
gia
r
(188)
Ge
da
ge
d
(387)
Ma
tuk
ar
(72)
Bilia
u
(401)
Me
ng
en
(353)
Ma
leu
(292)
Ko
ve
(575)
(574)
(661)
Ta
rpia
(600)
So
be
i
(272)
Ka
yu
pu
lau
K
(539)
Riw
o
(579) (250)
(249)
Kairiru
(358) (284)
Kis
(743)
Wogeo
(4)
Ali
(499)
(480)
(627)
(31)
Auhela
wa
(558)
Salib
a
(626)
Suau
(343)
Mais
in
(463)
(205)
Gum
awana
(104)
(140)
Dio
dio
(412)
Molim
a
(17) (16)
(695)
Ubir
(185)
Gapapaiw
a
(720)
Wedau
(141)
Dobuan
(508)
(281)
(409)
Mis
ima
(280)
Kiliv
ila
(117) (723)
(183)
Gabadi
(482) (143)
Doura
(295)
Kuni
(541)
Roro
(395)
Mekeo
(305)
Lala
(594)
(712)
Viliru
pu
(421)
Motu
(342)
MagoriS
out
(404)
(448)
(610)
(502)
(261)
Kandas
(655)
Tanga
(590)
Sia
r
(293)
Kuanua
(501)
Patp
ata
r
(447) (730)
(544)
Rovia
na
(593)
Sim
bo
(437)
Nduke
(296)
Kusaghe
(212)
Hoava
(335)
Lungga
(336)
Luqa
(294)
Kubokota
(696)
Ughele
(193)
Ghanongga
(154)
(706)
Vangunu
(382)
Marovo
(391)
Mbareke
(440)
(602)
Solos
(572)
(646)
Taiof
(668)
Teop
(438)
(439)
NehanHape
(458)
Nissan
(207)
Haku
(129)
(597)
Sisingga
(38)
BabatanaTu
(711)
VarisiGhon
(710)
Varisi
(538)
Ririo
(584)
Sengga
(35)
BabatanaAv
(705)
Vaghua
(34)
Babatana
(37)
BabatanaLo
(36)
BabatanaKa
(416)
(334)
LungaLunga
(413)
Mono
(414)
MonoAlu
(686)
Torau
(700)
Uruava
(415)
MonoFauro
(75)
Bilur
(571)
(157)
(128)ChekeHolo
(379)MaringeKma
(381)MaringeTat
(452)NggaoPoro
(380)MaringeLel
(732)
(755)ZabanaKia
(302)LaghuSamas
(124)
(82)Blablanga
(83)BlablangaG
(289)Kokota
(282)KilokakaYs
(49)
Banoni
(316)
(340)
Madara
(692)
TungagTung
(265)
KaraWest
(431)
Nalik
(673)
Tiang
(674)
Tigak
(324)
LihirSungl
(338)
(339)MadakLamas
(53)Barok
(741)
(388)
Maututu
(430)
NakanaiBil
(304)Lakalai
(714)
Vitu
(753)
Yapese
(
112)
(618)
(190)
(92)
(393)MbughotuDh
(95)Bugotu
(204)
(651)TaliseMoli
(650)TaliseMala
(195)GhariNdi (194)Ghari (681)Tolo (648)
Talise (347)
Malango (196)
GhariNggae (392)Mbirao (199)GhariTanda (652)TalisePole (197)GhariNgger (649)TaliseKoo (198)
GhariNgini (189) (319)
Lengo (320)
LengoGhaim (321)
LengoParip (453)
Nggela
(346)
(345)
(473)
(678)
Toambaita (298)
Kwai (312)
Langalanga
(299)
Kwaio (390)
Mbaengguu
(313)
Lau (389)
Mbaelelea
(301)
KwaraaeSol
(175)
Fataleka
(315)
LauWalade
(314)
LauNorth
(328)
Longgu (621)
(551)
SaaUkiNiMa
(550)
SaaSaaVill
(490)
Oroha
(552)
SaaUlawa
(548)
Saa
(142)
Dorio
(18)
AreareMaas
(19)
AreareWaia
(54
9)
SaaAuluVil
(56
4)
(58
)
BauroBaroo
(59
)
BauroHaunu
(17
2)
Fagani
(24
7)
KahuaMami
(57
0)
SantaCatal
(22
)
ArosiOneib
(23
)
ArosiTawat
(21
)
Arosi
(24
6)
Kahua
(60
)
BauroPawaV
(17
3)
FaganiAguf
(17
4)
FaganiRihu
(56
9)
SantaAna
(66
3)
Tawaroga
(
536)
(
709)
(
657)
(
658)
TannaSouth
(
300)
Kwamera
(
318)
Lenakel
(
171)
(
636)
SyeErrom
an
(
699)
Ura
(
467)
(
726)
(
15)
ArakiS
outh
(
402)
Mere
i
(
150)
(
10)
Ambry
mSout
(
420)
Mota
(
510)
Petera
raM
a
(
491)
PaameseSou
(
427)
Mwotla
p
(
531)
Raga
(
119)
(
607)
SouthEfa
te
(
489)
Orkon
(
454)
Nguna
(
432)
Namakir
(
352)
(
434)
Nati
(
429)
Nahavaq
(
444)
Nese
(
659)
Tape
(
12)
Anejom
Anei
(
557)
Saka
oPor
tO
(
351)
(
32)
Avav
a
(
445)
Neve
ei
(
433)
Nam
an
(
522)
(
406)
(
516)
(
520)
(
528)
Pulu
wat
ese
(
527)
Pulo
Annan
(
745)
Wole
aia
n
(
577)
Sata
wale
se
(
132)
ChuukeseAK
(
419)
Mortlo
ckes
(
106)
Caro
linia
n
(
131)
Chuukese
(
603)
Sonsoro
les
(
555)
Saip
anCaro
(
526)
Pulo
Anna
(
744)
Wole
ai
(
515)
(
411)
Mokilese
(
512)
Pin
gilapes
(
514)
Ponapean
(
283)
Kirib
ati
(
297)
Kusaie
(
385)
Mars
halles
(
436)
Nauru
(
332)
(
446)
(
474)
(
441)
Nele
mwa
(
244)
Jawe
(
105)
Canala
(
214)
Iaai
(
443)
Nengone
(
139)
Dehu
(
116)
(
725)
(
543)
Rotu
man
(
734)
Weste
rnFij
(
146)
(
513)
(
481)
(
156)
(
122)
(
645)
(
374)
Maori
(
642)
Tahiti
(
505)
Penrh
yn
(
689)
Tuam
otu
(
361)
Manih
iki
(
546)
Ruru
tuan
(
643)
Tahitia
nM
o
(
534)
Raro
tongan
(
644)
Ta
hit
ian
th
(
384)
(
383)
Ma
rqu
es
an
(
359)
Ma
ng
are
va
(
209)
Ha
wa
iia
n
(
533)
Ra
pa
nu
iEa
s
(563)
(56
2)
Sa
mo
an
(
182)
(
63)
Be
llo
na
(
675)
Tik
op
ia
(
163)
Em
ae
(
13)
An
uta
(
537)
Re
nn
ell
es
e
(
180)
Fu
tun
aA
niw
(
218)
Ifir
aM
ele
M
(
703)
Uv
ea
We
st
(
181)
Fu
tun
aE
as
t
(704)
Va
ea
ka
uT
au
(162)
(647)
Ta
ku
u (694)
Tu
va
lu (592)
Sik
aia
na
(264)
Ka
pin
ga
ma
r
(483)
Nu
ku
oro
(333)
Lu
an
giu
a (524)
Pu
ka
pu
ka
(680)
To
ke
lau
(702)
Uv
ea
Ea
st
(683)
(682)
To
ng
an
(459)
Niu
e (177)
Fij
ian
Ba
u
(159)
(707)
(666)
Te
an
u (708)
Va
no
(654)
Ta
ne
ma
(98)
Bu
ma
(701)
(26)
As
um
bo
a (656)
Ta
nim
bil
i (442)
Ne
mb
ao
(1)
(738)
(581)
Se
ima
t
(748)
Wu
vu
lu (160)
(616)
(330)
Lo
u (435)
Na
un
a
(373)
(153)
(317)
Le
ipo
n (598)
Siv
isa
Tit
a
(
729)
(
325)
Lik
um
(
323)
Le
ve
i
(
329)
Lo
niu
(
426)
Mu
ss
au
(
609)
(
608)
(
266)
Ka
sir
aIr
ah
(
200)
Gim
an
(
97)
Bu
li
(
108)
(
417)
Mo
r
(
532)
(
408)
Min
ya
ifu
in
(
68)
Big
aM
iso
ol
(
25)
As
(
718)
Wa
rop
en
(
485)
Nu
mfo
r
(
120)
(
378)
Ma
rau
(
8)
Am
ba
iYa
pe
n
(
742)
Win
desiW
an
(
519)
(
33)
(
465)
(
460)
Nort
hBabar
(
135)
Dawera
Dawe
(
134)
Dai
(
612)
(
622)
(
667)
Tela
Masbua
(
229)
Imro
ing
(
164)
Em
pla
was
(
386)
(
148)
EastM
asela
(
588)
Serili
(
115)
CentralM
as
(
606)
South
EastB
(
724)
WestD
am
ar
(
24)
(
660)
Tara
nganBa
(
697)
UjirN
Aru
(
450)
Ngaib
orS
Ar
(
114)
(
152)
(
45)
(
192)
(
191)
Geser
(
719)
Watu
bela
(
161)
Ela
tKeiB
es
(
586)
(
486)
(
587)
(
9)
(
211)
HituAm
bon
(
604)
SouAm
anaTe
(
6)
Am
ahai
(
503)
Paulo
hi
(
698)
(
5)
Alu
ne
(
425)
Murn
ate
nAl
(
86)
Bonfia
(
722)
Werinam
a
(
101)
Buru
Nam
rol
(
601)
Soboyo
(
77)
(
357)
Mam
boru
(
360)
Manggara
i
(
449)
Ngadha
(
517)
Pondok
(
287)
Kodi
(
721)
Weje
waTana
(
76)
Bim
a
(
599)
Soa
(
578)
Savu
(
716)
Wan
ukak
a
(
186)
Gaura
Nggau
(
149)
Eas tSum
ban
(
44)
Baliledo
(
308)
Lamboya
(
166)
(
326)
L ioF lo
resT
(
165)
Ende
(
428)
Nage
(
259)
Kambera
(
496)
PalueNitu
n
(
11)
Anakalang
(
461)
(
582)
Sekar
(
525)
PulauArg
un
(
676)
(
479)
(
731)
(
542)
RotiTerm
an
(
29)
Atoni
(
155)
(
356)
Mam
bai
(
669)
TetunTerik
(
277)
Kemak
(
178)
(
307)
Lamale
rale
(
306)
Lamaholo
tI
(
591)
S ika
(
273)
Kedang
( 62
3)
( 74
0)
(
170)
E rai
(
653)
Talur
( 14
)
Aputai
( 50
6)
Perai
( 69
0)
Tugun
( 22
3)
Iliun
( 67
1)
( 45
7)
( 58
9)
Serua
( 45
6)
Nila
( 67
0)
Teun
( 28
6)
( 28
5)
Kis ar
( 54
0)
Roma
( 14
5)
Eas tDamar
( 32
2)
Letinese
( 61
7)
( 27
5)
( 75
1)
Yamdena
( 27
4)
KeiTanimba
( 58
3)
Selaru
( 28
8)
KoiwaiIria
( 127)
Chamorro
( 509)
(310)
Lampung
( 290)
Komering
( 634)
Sunda
( 535)
RejangReja
( 74)
( 664)
( 665)
TboliTagab
( 638)
Tagabili
( 81)
( 291)
KoronadalB
( 573)
SaranganiB
( 70)
BilaanKoro
( 71)
BilaanSara
( 179)
( 28)
( 133)
CiuliAtaya
( 625)
SquliqAtay
( 580)
Sediq
( 100)
Bunun
( 737)
( 672)
Thao
( 176)
Favorlang
( 545)
Rukai
( 492)
Paiwan
( 688)
( 260)Kanakanabu
( 553)Saaroa
( 687)Tsou
( 530)
Puyuma
( 147)
( 472) ( 269) Kavalan ( 54) Basai
( 109) CentralAmi ( 596) S iraya
( 504) Pazeh ( 556) Sais iat
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
Figure S.2: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.
26
(521)
(559)
(750)
Ya
ka
n (632)
(91)
(40)
Ba
jo (375)
Ma
pu
n (560)
Sa
ma
lSia
si
(230)
Ina
ba
kn
on
(620)
(629)
(630)
Su
ba
nu
nS
in
(628)
Su
ba
no
nS
io (362)
(123)
(728)
(733)
We
ste
rnB
uk
(370)
Ma
no
bo
We
st
(366)
Ma
no
bo
Ilia
(365)
Ma
no
bo
Dib
a
(27)
(363)
Ma
no
bo
Ata
d
(364)
Ma
no
bo
Ata
u
(
369)
Ma
no
bo
Tig
w
(
614)
(
367)
Ma
no
bo
Ka
la
(
368)
Ma
no
bo
Sa
ra
(
79)
Bin
uk
id
(
377)
(
376)
Ma
ran
ao
(
233)
Ira
nu
n
(
43)
(
576)
Sa
sa
k
(
42)
Ba
li
(
405)
(
118)
(
355)
Ma
ma
nw
a
(
80)
(
121)
(
507)
(
225)
Ilo
ng
go
(
210)
Hil
iga
yn
on
(
717)
Wa
ray
Wa
ray
(
613)
(
102)
(
103)
Bu
tua
no
n
(
662)
Ta
us
ug
Jolo
(
635)
Su
rig
ao
no
n
(
3)
Akla
nonBis
(
107)
Cebuano
(
372)
(
252)
Kala
gan
(
371)
Mansaka
(
639)
Tagalo
gAnt
(
69)
Bik
olN
agaC
(
253)
(
641)
TagbanwaKa
(
640)
TagbanwaAb
(
495)
(
57)
Bata
kPala
w
(
494)
Pala
wanBat
(
47)
Banggi
(
547)
SW
Pala
wano
(
208)
Hanunoo
(
493)
Pala
uan
(
311)
(
595)
Sin
ghi
(
136)
DayakBakat
(
52)
(
727)
(
615)
(
267)
Katingan
(
137)
DayakNgaju
(
245)
Kadorih
(
151)
(
403)
MerinaM
ala
(
337)
Maanyan
(
693)
Tunju
ng
(
350)
(
349)
(
327)
(
279)
Kerinci
(
400)
Mela
yuSara
(
398)
Mela
yu
(
487)
Ogan
(
331)
LowM
ala
y
(
231)
Indonesia
n
(
48)
Banja
reseM
(
348)
Mala
yBahas
(
399)
Mela
yuBru
n
(
407)
Min
angkaba
(
126)
(
125)
(
511)
PhanRangCh
(
130)
Chru
(
206)
Hai
nanC
ham
(
410)
Mok
en
(
215)
Iban
(
187)
Gayo
(
477)
(
554)
(
217)
Idaa
n
(
99)
BunduDusun
(
464)
(
138)
(
677)
TimugonM
ur
(
276)
KelabitB
ar
(
78)
Bintu
lu
(
66)
(
65)
BerawanLon
(
62)
Belait
(
278)
KenyahLong
(
396)
(
397)
Mela
nauMuk
(
303)
Lahanan
(
633)
(
167)
(
169)
EngganoMal
(
168)
EngganoBan
(
455)
Nias
(
679)
TobaBatak
(
341)
Madure
se
( 47
0)
(
56)
(
55)
(
241)
(
242)
Ivata
nBas c
(
237)
Itbayat
(
39)
Babuyan
(
234)
Irara
lay
(
238)
Itbayate
n
(
235)
Isam
orong
(
240)
Ivasay
(
228)
Imoro
d
( 75
2)
Yami
( 11
3)
( 56
1)
SambalBoto
( 26
3)
Kapampanga
( 46
9)
( 60
5)
( 61
9)
( 49
8)
( 64
)
( 25
6)
( 22
2)Ifu
gaoBayn
( 25
7)Kalla
hanKa
( 25
8)Kalla
hanKe
( 23
2)
Inibaloi
( 49
7)
Pangas inan
( 22
6)
( 25
1)
KakidugenI
( 22
7)
IlongotKak
( 11
0)
( 47
8)
( 21
9)
( 22
0)
IfugaoAmga
( 22
1)
IfugaoBata
( 90
)
( 88
)
( 89
)BontokGuin
( 87
)BontocGuin
( 26
2)
KankanayNo
( 41)
Balangaw
( 255)
(254)
KalingaGui
( 239)
ItnegBinon
( 468)
( 216)
( 184)
Gaddang
( 30)
AttaPamplo
( 236)
Is negDibag
( 471)
( 2)
Agta
( 144)
DumagatCas
( 224)
Ilokano
( 243)
( 754)
Yogya
( 488)
OldJavanes
( 735)
Wes ternMalayoPolynes ian
( 631)
( 611)
( 94)
( 93)
BugineseSo
( 354)Maloh
( 344)
Makassar
( 637)
TaeSToraja
( 568)
( 476)
( 567)SangirTabu
( 566)Sangir
( 565)SangilSara
( 50)
Bantik
( 203)
( 201)
( 248) Kaidipang
( 202) GorontaloH
( 84)BolaangMon
( 423)
( 691) ( 518) Popalia ( 85) Bonerate
( 739) ( 747) Wuna ( 424) MunaKatobu
( 466) ( 685) Tontemboan ( 684)
Tonsea ( 46)BanggaiWdi ( 51)
Baree ( 746)
Wolio ( 418)
Mori
( 270) ( 271)
KayanUmaJu
( 96)
Bukat
( 529)
PunanKelai (111)
(158)
(523)
( 736)
( 462)
( 213) ( 715)
Wampar ( 749)
Yabem ( 484)
NumbamiS ib ( 451) ( 713)
( 624)
( 67) ( 422)
Mouk ( 20)
Aria ( 309)
LamogaiMul
( 7)
Amara
( 500) ( 585)
Sengseng
( 268)
KaulongAuV
( 61) ( 475) ( 73)
Bilibil ( 394)
Megiar
( 188)
Gedaged
( 387)
Matukar
( 72)
Biliau
( 401)
Mengen
( 353)
Maleu
( 292)
Kove
( 575)
( 574)
( 661)
Tarpia
( 600)
Sobei
( 272)
KayupulauK
( 539)
Riwo
( 579) ( 250) ( 249)
Kairiru
( 358) ( 284)
Kis
( 743)
Wogeo
( 4)
Ali
( 499)
( 480)
( 627)
( 31)
Auhelawa
( 558)
Saliba
( 626)
Suau
( 343)
Mais in
( 463)
( 205)
Gumawana
( 104)
( 140)
Diodio
( 412)
Molima
( 17) ( 16)
( 695)
Ubir
( 185)
Gapapaiwa
( 720)
Wedau
( 141)
Dobuan
( 508)
( 281)
( 409)
M is ima
( 280)
Kiliv ila
( 117) ( 723) ( 183)
Gabadi
( 482) ( 143)
Doura
( 295)
Kuni
( 541)
Roro
( 395)
Mekeo
( 305)
Lala
( 594)
( 712)
Vilirupu
( 421)
Motu
( 342)
MagoriSout
( 404)
( 448) ( 610)
( 502)
( 261)
Kandas
( 655)
Tanga
( 590)
S iar
( 293)
Kuanua
( 501)
Patpatar
( 447) ( 730)
( 544)
Rov iana
( 593)
S imbo
( 437)
Nduke
( 296)
Kusaghe
( 212)
Hoava
( 335)
Lungga
( 336)
Luqa
( 294)
Kubokota
( 696)
Ughele
( 193)
Ghanongga
( 154)
( 706)
Vangunu
( 382)
Marovo
( 391)
Mbare
ke
( 440) ( 602)
Solo
s
( 572)
( 646)
Taio
f
( 668)
Teop
( 438)
( 439)
NehanHape
( 458)
Nis
san
( 207)
Haku
( 129)
( 597)
Sis
ingga
( 38)
Babata
naTu
( 711)
Varis
iGhon
( 710)
Varis
i
( 538)
Ririo
( 584)
Sengga
( 35)
Babata
naAv
( 705)
Vaghua
( 34)
Babata
na
( 37)
Babata
naLo
( 36)
Babata
naKa
( 416)
( 334)
LungaLunga
( 413)
Mono
( 414)
MonoAlu
( 686)
Tora
u
( 700)
Uru
ava
( 415)
MonoFauro
( 75)
Bilu
r
( 571)
( 157)
( 128)ChekeHolo
( 379)M
arin
geKm
a
( 381)M
arin
geTat
( 452)NggaoPoro
( 380)M
arin
geLel
( 732)
( 755)ZabanaKia
( 302)LaghuSam
as
( 124)
( 82)Bla
bla
nga
( 83)Bla
bla
ngaG
( 289)K
ok
ota
( 282)K
ilok
ak
aY
s
( 49)
Ba
no
ni
( 316)
( 340)
Ma
da
ra
( 692)
Tu
ng
ag
Tu
ng
( 265)
Ka
raW
es
t
( 431)
Na
lik
( 673)
Tia
ng
( 674)
Tig
ak
( 324)
Lih
irSu
ng
l
( 338)
( 339)M
ad
ak
La
ma
s
( 53)
Ba
rok
( 741)
( 388)
Ma
utu
tu
( 430)
Na
ka
na
iBil
( 304)
La
ka
lai
(714)
Vitu
(753)
Ya
pe
se
(112)
(618)
(190)
(92)
(393)
Mb
ug
ho
tuD
h
(95)
Bu
go
tu
(204)
(651)
Ta
lise
Mo
li
(650)
Ta
lise
Ma
la
(195)
Gh
ariN
di
(194)
Gh
ari
(681)
To
lo (648)
Ta
lise
(347)
Ma
lan
go
(196)
Gh
ariN
gg
ae
(392)
Mb
irao
(199)
Gh
ariT
an
da
(652)
Ta
lise
Po
le (197)
Gh
ariN
gg
er
(649)
Ta
lise
Ko
o (198)
Gh
ariN
gin
i (189)
(319)
Le
ng
o (320)
Le
ng
oG
ha
im (321)
Le
ng
oP
arip
(453)
Ng
ge
la
(346)
(345)
(473)
(678)
To
am
ba
ita (298)
Kw
ai
(312)
La
ng
ala
ng
a
(299)
Kw
aio
(390)
Mb
ae
ng
gu
u
(313)
La
u (3
89)
Mb
ae
lele
a
(301)
Kw
ara
ae
So
l
(175)
Fa
tale
ka
(315)
La
uW
ala
de
(314)
La
uN
orth
(328)
Lo
ng
gu
(621)
(551)
Sa
aU
kiN
iMa
(550)
Sa
aS
aa
Vill
(490)
Oro
ha
(552)
Sa
aU
law
a
(548)
Sa
a
(142)
Do
rio
(18)
Are
are
Ma
as
(19)
Are
are
Waia
(549)
SaaAulu
Vil
(564)
(58)
Bauro
Baro
o
(59)
Bauro
Haunu
(172)
Fagani
(247)
KahuaM
am
i
(570)
Santa
Cata
l
(22)
Aro
siO
neib
(23)
Aro
siT
awat
(21)
Aro
si
(246)
Kahua
(60)
Bauro
PawaV
(173)
FaganiA
guf
(174)
FaganiR
ihu
(569)
Santa
Ana
(663)
Tawaro
ga
(536)
(709)
(657)
(658)
TannaSouth
(300)
Kwam
era
(318)
Lenakel
(171)
(636)
SyeErro
man
(699)
Ura
(467)
(726)
(15)
Ara
kiS
outh
(402)
Mere
i
(150) (10)
Am
bry
mSout
(420)
Mota
(510)
Pete
rara
Ma
(491)
Paam
eseSou
(427)
Mwotla
p
(531)
Raga
(119)
(607)
South
Efa
te
(489)
Ork
on
(454)
Nguna
(432)
Nam
akir
(352)
(434)
Nati
(429)
Nahavaq
(444)
Nese
(659)
Tape
(12)
AnejomAnei
(557)
SakaoPortO
(351)
(32)
Avava
(445)
Neveei
(433)
Naman
(522)
(406)
(516)
(520)
(528)
Puluwatese
(527)
PuloAnnan
(745)
Woleaian
(577)
Satawalese
(132)
ChuukeseAK
(419)
Mortlockes
(106)
Carolinian
(131)
Chuukese
(603)
Sonsoroles
(555)
SaipanCaro
(526)
PuloAnna
(744)
Woleai
(515)
(411)
Mokilese
(512)
Pingilapes
(514)
Ponapean
(283)
Kiribati
(297)
Kusaie
(385)
Marshalles
(436)
Nauru
(332)
(446)
(474)
(441)
Nelemwa
(244)
Jawe
(105)
Canala
(214)
Iaai
(443)
Nengone
(139)
Dehu
(116)
(725)
(543)
Rotuman
(734)
WesternFij
(146) (513)
(481)
(156) (122)
(645)
(374)Maori
(642)Tahiti
(505)Penrhyn
(689)Tuamotu
(361)Manihiki
(546)Rurutuan
(643)TahitianMo
(534)Rarotongan
(644)Tahitianth
(384)
(383)Marquesan
(359)Mangareva
(209)Hawaiian
(533)
RapanuiEas
(563)
(562)
Samoan
(182)
(63)Bellona
(675)Tikopia
(163)Emae
(13)Anuta
(537)Rennellese
(180)FutunaAniw
(218)IfiraMeleM
(703)UveaWest
(181)FutunaEast
(704)VaeakauTau
(162)
(647)Takuu (694)
Tuvalu (592)
Sikaiana (264)
Kapingamar
(483)Nukuoro
(333)Luangiua
(524)Pukapuka (680)Tokelau (702)
UveaEast
(683) (682)
Tongan (459)
Niue (177)FijianBau
(159) (707)
(666)Teanu (708)Vano (654)Tanema (98)
Buma (701) (26)
Asumboa (656)
Tanimbili (442)
Nembao (1)
(738) (581)
Seimat
(748)
Wuvulu (160)
(616) (330)
Lou (435)
Nauna (373)
(153) (317)
Leipon (598)
SivisaTita
(729) (325)
Likum (323)
Levei
(329)
Loniu
(426)
Mussau
(609)
(608)
(266)
KasiraIrah
(200)
Giman
(97)
Buli
(108)
(417)
Mor
(532) (408)
Minyaifuin
(68)
BigaMisool
(25)
As
(718)
Waropen
(485)
Numfor
(120)
(378)
Marau
(8)
AmbaiYapen
(742)
WindesiWan (
519)
(33
)
(465)
(46
0)
NorthBabar
(13
5)
DaweraDawe
(13
4)
Dai
(61
2)
(62
2)
(66
7)
TelaMasbua
(22
9)
Imroing
(16
4)
Emplawas
(38
6)
(14
8)
EastMasela
(58
8)
Serili
(11
5)
CentralM
as
(60
6)
SouthEastB
(72
4)
WestDamar
(24
)
(66
0)
TaranganBa
(69
7)
UjirNAru
(45
0)
NgaiborSAr
(
114)
(
152)
(
45)
(
192)
(19
1)
Geser
(
719)
Watu
bela
(
161)
ElatK
eiBes
(
586)
(
486)
(
587)
(
9)
(
211)
HituAm
bon
(
604)
SouAmanaTe
(
6)
Amahai
(
503)
Paulohi
(
698)
(
5)
Alune
(
425)
Murn
atenAl
(
86)
Bonfia
(
722)
Werin
ama
(
101)
BuruNam
rol
(
601)
Soboyo
(
77)
(
357)
Mam
boru
(
360)
Manggara
i
(
449)
Ngadha
(
517)
Pondok
(
287)
Kodi
(
721)
Weje
waTana
(
76)
Bima
(
599)
Soa
(
578)
Savu
(
716)
Wanukaka
(
186)
Gaura
Nggau
(
149)
East
Sum
ban
(
44)
Balil
edo
(
308)
Lam
boya
(
166)
(
326)
LioF
lore
sT
(
165)
Ende
(
428)
Nage
(
259)
Kam
bera
(
496)
Palu
eNitun
(
11)
Anakala
ng
(
461)
(
582)
Sekar
(
525)
Pula
uArg
un
(
676)
(
479)
(
731)
(
542)
RotiTerm
an
(
29)
Ato
ni
(
155)
(
356)
Mam
bai
(
669)
Tetu
nTerik
(
277)
Kem
ak
(
178)
(
307)
Lam
ale
rale
(
306)
Lam
aholo
tI
(
591)
Sik
a
(
273)
Kedang
(
623)
(
740)
(
170)
Era
i
(
653)
Talu
r
(
14)
Aputa
i
(
506)
Pera
i
(
690)
Tugun
(
223)
Iliu
n
(
671)
(
457)
(
589)
Seru
a
(
456)
Nila
(
670)
Teun
(
286)
(
285)
Kis
ar
(
540)
Rom
a
(
145)
EastD
am
ar
(
322)
Letinese
(
617)
(
275)
(
751)
Yam
dena
(
274)
KeiT
anim
ba
(
583)
Sela
ru
(
288)
Koiw
aiIria
(
127)
Cham
orr
o
(
509)
(
310)
La
mp
un
g
(
290)
Ko
me
rin
g
(
634)
Su
nd
a
(
535)
Re
jan
gR
eja
(
74)
(
664)
(
665)
Tb
oli
Ta
ga
b
(
638)
Ta
ga
bil
i
(
81)
(
291)
Ko
ron
ad
alB
(
573)
Sa
ran
ga
niB
(
70)
Bil
aa
nK
oro
(
71)
Bil
aa
nS
ara
(179)
(
28)
(
133)
Ciu
liA
tay
a
(
625)
Sq
uli
qA
tay
(
580)
Se
diq
(
100)
Bu
nu
n
(737)
(
672)
Th
ao
(176)
Fa
vo
rla
ng
(545)
Ru
ka
i
(492)
Pa
iwa
n
(688)
(260)
Ka
na
ka
na
bu
(553)
Sa
aro
a
(687)
Ts
ou
(530)
Pu
yu
ma
(147)
(472)
(269)
Ka
va
lan
(54)
Ba
sa
i (109)
Ce
ntr
alA
mi
(596)
Sir
ay
a (504)
Pa
ze
h (556)
Sa
isia
t !
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
Figure S.3: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.
27
( 521)
(559)
(750)Yakan (632) (91) (40)Bajo (375)Mapun
(560)
SamalSiasi
(230)
Inabaknon (620)
(629) (630)
SubanunSin
(628)
SubanonSio (362) (123)
(728) (733)
WesternBuk (370)
ManoboWest
(366)
ManoboIlia
(365)
ManoboDiba
(27) (363)
ManoboAtad
(364)
ManoboAtau
(369)
ManoboTigw
(614)
(367)
ManoboKala
(368)
ManoboSara
(79)
Binukid
(377)
(376)
Maranao
(233)
Iranun
(43)
(576)
Sasak
(42)
Bali
(40
5) (118)
(355)
Mamanwa
(80) (121) (507)
(225)
Ilonggo
(210)
Hiligaynon
(717)
WarayWaray
(613)
(102)
(103)
Butuanon
(662)
TausugJolo
(635)
Surigaonon
(3)
AklanonBis
(10
7)
Cebuano
(37
2)
(25
2)
Kalagan
(37
1)
Mansaka
(63
9)
TagalogAnt
(69
)
BikolNagaC
(25
3)
(64
1)
TagbanwaKa
(64
0)
TagbanwaAb
(49
5)
(57
)
BatakPalaw
(49
4)
PalawanBat
(47
)
Banggi
(54
7)
SWPalawano
(20
8)
Hanunoo
(49
3)
Palauan
(31
1)
(59
5)
Singhi
(13
6)
DayakBakat
(
52)
(
727)
(
615)
(
267)
Katingan
(
137)
DayakNgaju
(
245)
Kadorih
(
151)
(
403)
Merin
aMala
(
337)
Maanyan
(
693)
Tunjung
(
350)
(
349)
(
327)
(
279)
Kerinci
(
400)
Mela
yuSara
(
398)
Mela
yu
(
487)
Ogan
(
331)
LowMala
y
(
231)
Indonesia
n
(
48)
Banjare
seM
(
348)
Mala
yBahas
(
399)
Mela
yuBrun
(
407)
Min
angkaba
(
126)
(
125)
(
511)
PhanRangCh
(
130)
Chru
(
206)
HainanCham
(
410)
Moken
(
215)
Iban
(
187)
Gayo
(
477)
(
554)
(
217)
Idaa
n
(
99)
Bund
uDus
un
(
464)
(
138)
(
677)
Tim
ugon
Mur
(
276)
Kela
bitB
ar
(
78)
Bint
ulu
(
66)
(
65)
Bera
wanLon
(
62)
Bela
it
(
278)
KenyahLong
(
396)
(
397)
Mela
nauM
uk
(
303)
Lahanan
(
633)
(
167)
(
169)
EngganoM
al
(
168)
EngganoBan
(
455)
Nia
s
(
679)
TobaBata
k
(
341)
Madure
se
(
470)
(56
)
(
55)
(
241)
(
242)
Ivata
nBasc
(
237)
Itbayat
(
39)
Babuyan
(
234)
Irara
lay
(
238)
Itbayate
n
(
235)
Isam
oro
ng
(
240)
Ivasay
(
228)
Imoro
d
(
752)
Yam
i
(
113)
(
561)
Sam
balB
oto
(
263)
Kapam
panga
(
469)
(
605)
(
619)
(
498)
(
64)
(
256)
(
222)
IfugaoBayn
(
257)
KallahanKa
(
258)
KallahanKe
(
232)
Inib
alo
i
(
497)
Pangasin
an
(
226)
(
251)
Kakid
ugenI
(
227)
IlongotK
ak
(
110)
(
478)
(
219)
(
220)
IfugaoAm
ga
(
221)
IfugaoBata
(
90)
(
88)
(
89)
Bonto
kGuin
(
87)
Bonto
cGuin
(
262)
KankanayNo
(
41)
Bala
ngaw
(
255)
(
254)
Ka
lin
ga
Gu
i
(
239)
Itn
eg
Bin
on
(
468)
(
216)
(
184)
Ga
dd
an
g
(
30)
Att
aP
am
plo
(
236)
Isn
eg
Dib
ag
(
471)
(
2)
Ag
ta
(
144)
Du
ma
ga
tCa
s
(
224)
Ilo
ka
no
(
243)
(
754)
Yo
gy
a
(
488)
Old
Jav
an
es
(
735)
We
ste
rnM
ala
yo
Po
lyn
es
ian
(631)
(
611)
(
94)
(
93)
Bu
gin
es
eS
o
(
354)
Ma
loh
(
344)
Ma
ka
ss
ar
(
637)
Ta
eS
To
raja
(568)
(476)
(567)
Sa
ng
irT
ab
u
(566)
Sa
ng
ir
(565)
Sa
ng
ilS
ara
(50)
Ba
nti
k
(203)
(201)
(248)
Ka
idip
an
g
(202)
Go
ron
talo
H
(84)
Bo
laa
ng
Mo
n
(423)
(691)
(518)
Po
pa
lia
(85)
Bo
ne
rate
(739)
(747)
Wu
na
(424)
Mu
na
Ka
tob
u
(466)
(685)
To
nte
mb
oa
n (684)
To
ns
ea
(46)
Ba
ng
ga
iWd
i (51)
Ba
ree
(746)
Wo
lio
(418)
Mo
ri
(270)
(271)
Ka
ya
nU
ma
Ju
(96)
Bu
ka
t
(529)
Pu
na
nK
ela
i
( 111)
( 158)
( 523)
(
736)
(
462)
(213)
(715)
Wa
mp
ar
(749)
Ya
be
m (484)
Nu
mb
am
iSib
(
451)
(
713)
(624)
(67)
(422)
Mo
uk
(20)
Ari
a (309)
La
mo
ga
iMu
l
(
7)
Am
ara
(
500)
(
585)
Se
ng
se
ng
(
268)
Ka
ulo
ng
Au
V
(
61)
(
475)
(
73)
Bil
ibil
(
394)
Me
gia
r
(
188)
Ge
da
ge
d
(
387)
Ma
tuk
ar
(
72)
Bil
iau
(
401)
Me
ng
en
(
353)
Ma
leu
(
292)
Ko
ve
(
575)
(
574)
(
661)
Ta
rpia
(
600)
So
be
i
(
272)
Ka
yu
pu
lau
K
(
539)
Riw
o
(
579)
(
250)
(
249)
Kair
iru
(
358)
(
284)
Kis
(
743)
Wogeo
(
4)
Ali
(
499)
(
480)
(
627)
(
31)
Auhela
wa
(
558)
Saliba
(
626)
Suau
(
343)
Mais
in
(
463)
(
205)
Gum
awana
(
104)
(
140)
Dio
dio
(
412)
Molim
a
(
17)
(
16)
(
695)
Ubir
(
185)
Gapapaiw
a
(
720)
Wedau
(
141)
Dobuan
(
508)
(
281)
(
409)
Mis
ima
(
280)
Kiliv
ila
(
117)
(
723)
(
183)
Gabadi
(
482)
(
143)
Doura
(
295)
Kuni
(
541)
Roro
(
395)
Mekeo
(
305)
Lala
(
594)
(
712)
Vilirupu
(
421)
Motu
(
342)
MagoriSout
(
404)
(
448)
(
610)
(
502)
(
261)
Kandas
(
655)
Tanga
(
590)
Sia
r
(
293)
Kuanua
(
501)
Patp
ata
r
(
447)
(
730)
(
544)
Rovia
na
(
593)
Sim
bo
(
437)
Nduke
(
296)
Kusa
ghe
(
212)
Hoav
a
(
335)
Lung
ga
(
336)
Luqa
(
294)
Kubo
kota
(
696)
Ughele
(
193)
Ghanongga
(
154)
(
706)
Vangunu
(
382)
Maro
vo
(
391)
Mbare
ke
(
440)
(
602)
Solos
(
572)
(
646)
Taiof
(
668)
Teop
(
438)
(
439)
NehanHape
(
458)
Niss an
(
207)
Haku
(
129)
(
597)
S isin
gga
(
38)
BabatanaTu
(
711)
VarisiG
hon
(
710)
Varisi
(
538)
Ririo
(
584)
Sengga
(
35)
BabatanaAv
(
705)
Vaghua
(
34)
Babatana
(
37)
BabatanaLo
(
36)
BabatanaKa
( 41
6)
( 33
4)
LungaLunga
( 41
3)
Mono
( 41
4)
MonoAlu
( 68
6)
Torau
( 70
0)
Uruava
( 41
5)
MonoFauro
( 75
)
Bilur
( 57
1)
( 15
7)
( 12
8)ChekeHolo
( 37
9)Marin
geKma
( 38
1)Marin
geTat
( 45
2)NggaoPoro
( 38
0)Marin
geLel
( 73
2)
( 75
5)ZabanaKia
( 30
2)LaghuSamas
(124)
( 82
)Blablanga
( 83) BlablangaG
(289) Kokota
( 282) KilokakaYs
( 49)
Banoni
( 316)
( 340)
Madara
( 692)
TungagTung
( 265)
KaraWes t
( 431)
Nalik
( 673)
Tiang
( 674)
Tigak
( 324)
L ihirSungl
( 338)
( 339)MadakLamas
( 53)Barok
( 741)
( 388)
Maututu
( 430)
NakanaiBil
( 304)Lakalai
( 714)
Vitu
( 753)
Yapese
( 112)
( 618)
( 190)
( 92)
( 393) MbughotuDh
( 95) Bugotu
( 204)
( 651) Talis eMoli
( 650) Talis eMala
( 195) GhariNdi ( 194) Ghari ( 681) Tolo ( 648) Talis e ( 347) Malango ( 196)
GhariNggae ( 392)
Mbirao ( 199)GhariTanda ( 652)Talis ePole ( 197)GhariNgger ( 649)Talis eKoo ( 198)
GhariNgini ( 189) ( 319)
Lengo ( 320)
LengoGhaim ( 321)
LengoParip ( 453)
Nggela
( 346)
( 345)
( 473)
( 678)
Toambaita ( 298)
Kwai ( 312)
Langalanga
( 299)
Kwaio ( 390)
Mbaengguu
( 313)
Lau ( 389)
Mbaelelea
( 301)
KwaraaeSol
( 175)
Fataleka
( 315)
LauWalade
( 314)
LauNorth
( 328)
Longgu ( 621)
( 551)
SaaUkiNiMa
( 550)
SaaSaaVill
( 490)
Oroha
( 552)
SaaUlawa
( 548)
Saa
( 142)
Dorio
( 18)
AreareMaas
( 19)
AreareWaia
( 549)
SaaAuluVil
( 564)
( 58)
BauroBaroo
( 59)
BauroHaunu
( 172)
Fagani
( 247)
KahuaMami
( 570)
SantaCatal
( 22)
Aros iOneib
( 23)
Aros iTawat
( 21)
Aros i
( 246)
Kahua
( 60)
BauroPawaV
( 173)
FaganiAguf
( 174)
FaganiRihu
( 569)
SantaAna
( 663)
Tawaroga
( 536)
( 709)
( 657)
( 658)
TannaSouth
( 300)
Kwamera
( 318)
Lenakel
( 171)
( 636)
SyeErroman
( 699)
Ura
( 467)
( 726)
( 15)
ArakiSouth
( 402)
Merei
( 150) ( 10)
Ambrym
Sout
( 420)
Mota
( 510)
PeteraraMa
( 491)
PaameseSou
( 427)
Mwotlap
( 531)
Raga
( 119)
( 607)
SouthEfate
( 489)
Orkon
( 454)
Nguna
( 432)
Namakir
( 352)
( 434)
Nati
( 429)
Nahavaq
( 444)
Nese
( 659)
Tape
( 12)
AnejomAnei
( 557)
SakaoPortO
( 351)
( 32)
Avava
( 445)
Neveei
( 433)
Naman
( 522)
( 406)
( 516)
( 520)
( 528)
Puluwatese
( 527)
Pulo
Annan
( 745)
Wole
aia
n
( 577)
Sata
wale
se
( 132)
ChuukeseAK
( 419)
Mortlo
ckes
( 106)
Caro
linia
n
( 131)
Chuukese
( 603)
Sonsoro
les
( 555)
Saip
anCaro
( 526)
Pulo
Anna
( 744)
Wole
ai
( 515)
( 411)
Mokile
se
( 512)
Pin
gila
pes
( 514)
Ponapean
( 283)
Kirib
ati
( 297)
Kusaie
( 385)
Mars
halle
s
( 436)
Nauru
( 332)
( 446)
( 474)
( 441)
Nele
mwa
( 244)
Jawe
( 105)
Canala
( 214)
Iaai
( 443)
Nengone
( 139)
Dehu
( 116)
( 725)
( 543)
Rotu
man
( 734)
Weste
rnFij
( 146) ( 513)
( 481)
( 156) ( 122) ( 645)
( 374)M
aori
( 642)Tahiti
( 505)Penrh
yn
( 689)Tuam
otu
( 361)M
anih
iki
( 546)Ruru
tuan
( 643)Tahitia
nM
o
( 534)Raro
tongan
( 644)T
ah
itian
th
( 384)
( 383)M
arq
ue
sa
n
( 359)M
an
ga
rev
a
( 209)H
aw
aiia
n
( 533)
Ra
pa
nu
iEa
s
(563) ( 562)
Sa
mo
an
( 182)
( 63)B
ello
na
( 675)T
iko
pia
( 163)E
ma
e
( 13)A
nu
ta
( 537)R
en
ne
lles
e
( 180)
Fu
tun
aA
niw
( 218)
IfiraM
ele
M
( 703)
Uv
ea
We
st
( 181)
Fu
tun
aE
as
t
(704)
Va
ea
ka
uT
au
(162)
(647)
Ta
ku
u (694)
Tu
va
lu (592)
Sik
aia
na
(264)
Ka
pin
ga
ma
r
(483)
Nu
ku
oro
(333)
Lu
an
giu
a (524)
Pu
ka
pu
ka
(680)
To
ke
lau
(702)
Uv
ea
Ea
st
(683)
(682)
To
ng
an
(459)
Niu
e (177)
Fijia
nB
au
(159)
(707)
(666)
Te
an
u (708)
Va
no
(654)
Ta
ne
ma
(98)
Bu
ma
(701)
(26)
As
um
bo
a (656)
Ta
nim
bili
(442)
Ne
mb
ao
(1)
(738)
(581)
Se
ima
t
(748)
Wu
vu
lu (160)
(616)
(330)
Lo
u (435)
Na
un
a (3
73)
(153)
(317)
Le
ipo
n (598)
Siv
isa
Tita
(729)
(325)
Lik
um
(323)
Le
ve
i
(329)
Lo
niu
(426)
Mu
ss
au
(609)
(608)
(266)
Ka
sira
Irah
(200)
Gim
an
(97)
Bu
li
(108)
(417)
Mo
r
(532) (408)
Min
ya
ifuin
(68)
Big
aM
iso
ol
(25)
As
(718)
Wa
rop
en
(485)
Nu
mfo
r
(120)
(378)
Ma
rau
(8)
Am
ba
iYa
pe
n
(742)
Win
desiW
an
(519)
(33) (465)
(460)
North
Babar
(135)
Dawera
Dawe
(134)
Dai
(612)
(622)
(667)
Tela
Masbua
(229)
Imro
ing
(164)
Em
pla
was
(386)
(148)
EastM
asela
(588)
Serili
(115)
Centra
lMas
(606)
South
EastB
(724)
WestD
am
ar
(24)
(660)
Tara
nganBa
(697)
UjirN
Aru
(450)
Ngaib
orS
Ar
(114) (152)
(45)
(192)
(191)
Geser
(719)
Watu
bela
(161)
Ela
tKeiB
es
(586) (486)
(587) (9)
(211)
Hitu
Am
bon
(604)
SouAm
anaTe
(6)
Am
ahai
(503)
Paulo
hi
(698)
(5)
Alu
ne
(425)
Murn
ate
nAl
(86)
Bonfia
(722)
Werin
am
a
(101)
Buru
Nam
rol
(601)
Soboyo
(77)
(357)
Mam
boru
(360)
Manggara
i
(449)
Ngadha
(517)
Pondok
(287)
Kodi
(721)
Weje
waTana
(76)
Bima
(599)
Soa
(578)
Savu
(716)
Wanukaka
(186)
GauraNggau
(149)
EastSumban
(44)
Baliledo
(308)
Lamboya
(166)
(326)
LioFloresT
(165)
Ende
(428)
Nage
(259)
Kambera
(496)
PalueNitun
(11)
Anakalang
(461)
(582)
Sekar
(525)
PulauArgun
(676)
(479)
(731)
(542)
RotiTerman
(29)
Atoni
(155)
(356)
Mam
bai
(669)
TetunTerik
(277)
Kemak
(178)
(307)
Lamalerale
(306)
LamaholotI
(591)
Sika
(273)
Kedang
(623)
(740)
(170)
Erai
(653)
Talur
(14)
Aputai
(506)
Perai
(690)
Tugun
(223)
Iliun
(671)
(457)
(589)
Serua
(456)
Nila
(670)
Teun
(286)
(285)
Kisar
(540)
Roma
(145)
EastDamar
(322)
Letinese
(617)
(275)
(751)
Yamdena
(274)
KeiTanimba
(583)
Selaru
(288)
KoiwaiIria
(127)
Chamorro
(509)
(310)
Lampung
(290)
Komering
(634)
Sunda
(535)
RejangReja
(74)
(664)
(665)
TboliTagab
(638)
Tagabili
(81)
(291)
KoronadalB
(573)
SaranganiB
(70)
BilaanKoro
(71)
BilaanSara
(179)
(28)
(133)
CiuliAtaya
(625)
SquliqAtay
(580)
Sediq
(100)
Bunun
(737)
(672)
Thao
(176)
Favorlang
(545)
Rukai
(492)
Paiwan
(688)
(260)Kanakanabu
(553)Saaroa
(687)Tsou
(530)
Puyuma
(147)
(472) (269)Kavalan (54)Basai
(109)CentralAmi (596)Siraya
(504)Pazeh (556)Saisiat
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
Figure S.4: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.
28
(521)
(559)
(750)
Ya
ka
n (632)
(91)
(40)
Ba
jo (375)
Ma
pu
n (560)
Sa
ma
lSia
si
(230)
Ina
ba
kn
on
(620)
(629)
(630)
Su
ba
nu
nS
in
(628)
Su
ba
no
nS
io (362)
(123)
(728)
(733)
We
ste
rnB
uk
(370)
Ma
no
bo
We
st
(366)
Ma
no
bo
Ilia
(365)
Ma
no
bo
Dib
a
(27)
(363)
Ma
no
bo
Ata
d
(364)
Ma
no
bo
Ata
u
(369)
Ma
no
bo
Tig
w
(614)
(367)
Ma
no
bo
Ka
la
(368)
Ma
no
bo
Sa
ra
(79)
Bin
uk
id
(377)
(376)
Ma
ran
ao
(233)
Iran
un
(43)
(576)
Sa
sa
k
(42)
Ba
li
(405) (118)
(355)
Ma
ma
nw
a
(80) (121)
(507) (225)
Ilon
gg
o
(210)
Hilig
ay
no
n
(717)
Wa
ray
Wa
ray
(613)
(102)
(103)
Bu
tua
no
n
(662)
Ta
us
ug
Jolo
(635)
Su
riga
on
on
(3)
Akla
nonBis
(107)
Cebuano
(372)
(252)
Kala
gan
(371)
Mansaka
(639)
Tagalo
gAnt
(69)
Bik
olN
agaC
(253)
(641)
TagbanwaKa
(640)
TagbanwaAb
(495)
(57)
Bata
kPala
w
(494)
Pala
wanBat
(47)
Banggi
(547)
SW
Pala
wano
(208)
Hanunoo
(493)
Pala
uan
(311)
(595)
Sin
ghi
(136)
DayakBakat
(52)
(727)
(615)
(267)
Katin
gan
(137)
DayakNgaju
(245)
Kadorih
(151)
(403)
Merin
aM
ala
(337)
Maanyan
(693)
Tunju
ng
(350)
(349)
(327)
(279)
Kerin
ci
(400)
Mela
yuSara
(398)
Mela
yu
(487)
Ogan
(331)
LowM
ala
y
(231)
Indonesia
n
(48)
Banja
reseM
(348)
Mala
yBahas
(399)
Mela
yuBru
n
(407)
Min
angkaba
(126)
(125)
(511)
PhanRangCh
(130)
Chru
(206)
HainanC
ham
(410)
Moken
(215)
Iban
(187)
Gayo
(477)
(554)
(217)
Idaan
(99)
BunduDusun
(464)
(138)
(677)
TimugonM
ur
(276)
KelabitBar
(78)
Bintulu
(66)
(65)
BerawanLon
(62)
Belait
(278)
KenyahLong
(396)
(397)
MelanauM
uk
(303)
Lahanan
(633)
(167)
(169)
EngganoMal
(168)
EngganoBan
(455)
Nias
(679)
TobaBatak
(341)
Madurese
(470)
(56)
(55)
(241)
(242)
IvatanBasc
(237)
Itbayat
(39)
Babuyan
(234)
Iraralay
(238)
Itbayaten
(235)
Isamorong
(240)
Ivasay
(228)
Imorod
(752)
Yami
(113)
(561)
SambalBoto
(263)
Kapampanga
(469)
(605)
(619)
(498)
(64)
(256)
(222)IfugaoBayn
(257)KallahanKa
(258)KallahanKe
(232)
Inibaloi
(497)
Pangasinan
(226)
(251)
KakidugenI
(227)
IlongotKak
(110)
(478)
(219)
(220)
IfugaoAmga
(221)
IfugaoBata
(90)
(88)
(89)BontokGuin
(87)BontocGuin
(262)
KankanayNo
(41)
Balangaw
(255)
(254)
KalingaGui
(239)
ItnegBinon
(468)
(216)
(184)
Gaddang
(30)
AttaPamplo
(236)
IsnegDibag
(471)
(2)
Agta
(144)
DumagatCas
(224)
Ilokano
(243)
(754)
Yogya
(488)
OldJavanes
(735)
WesternMalayoPolynesian
(631)
(611)
(94)
(93)
BugineseSo
(354)Maloh
(344)
Makassar
(637)
TaeSToraja
(568)
(476)
(567)SangirTabu
(566)Sangir
(565)SangilSara
(50)
Bantik
(203)
(201)
(248)Kaidipang
(202)GorontaloH
(84)BolaangMon
(423)
(691) (518)Popalia (85)Bonerate
(739) (747)Wuna (424)
MunaKatobu
(466) (685)
Tontemboan (684)Tonsea (46)BanggaiWdi (51)
Baree (746)
Wolio (418)
Mori
(270) (271)
KayanUmaJu
(96)
Bukat
(529)
PunanKelai
(11
1)
(
158)
(523)
(
736)
(462)
(213) (715)
Wampar (749)
Yabem (484)
NumbamiSib (451) (713)
(624)
(67) (422)
Mouk (20)
Aria (309)
LamogaiMul
(7)
Amara
(500) (585)
Sengseng
(268)
KaulongAuV
(61) (475) (73)
Bilibil (394)
Megiar
(188)
Gedaged
(387)
Matukar
(72)
Biliau
(401)
Mengen
(353)
Maleu
(292)
Kove
(575)
(574)
(661)
Tarpia
(600)
Sobei
(272)
KayupulauK
(539)
Riwo
(57
9)
(250)
(249)
Kairiru
(35
8)
(284)
Kis
(74
3)
Wogeo
(4)
Ali
(49
9)
(48
0)
(62
7)
(31
)
Auhelawa
(55
8)
Saliba
(62
6)
Suau
(34
3)
Maisin
(46
3)
(20
5)
Gumawana
(10
4)
(14
0)
Diodio
(41
2)
Molima
(17
)
(16)
(69
5)
Ubir
(18
5)
Gapapaiwa
(72
0)
Wedau
(14
1)
Dobuan
(
508)
(
281)
(40
9)
Misima
(
280)
Kilivila
(
117)
(
723)
(
183)
Gabadi
(
482)
(14
3)
Doura
(
295)
Kuni
(
541)
Roro
(
395)
Mekeo
(
305)
Lala
(
594)
(
712)
Viliru
pu
(
421)
Motu
(
342)
MagoriS
out
(
404)
(
448)
(61
0)
(
502)
(
261)
Kandas
(
655)
Tanga
(
590)
Siar
(
293)
Kuanua
(
501)
Patpata
r
(
447)
(
730)
(
544)
Roviana
(
593)
Simbo
(
437)
Nduke
(
296)
Kusaghe
(
212)
Hoava
(
335)
Lungga
(
336)
Luqa
(
294)
Kubo
kota
(
696)
Ughe
le
(
193)
Ghan
ongg
a
(
154)
(
706)
Vang
unu
(
382)
Mar
ovo
(
391)
Mbare
ke
(
440)
(
602)
Solo
s
(
572)
(
646)
Taio
f
(
668)
Teop
(
438)
(
439)
NehanHape
(
458)
Nis
san
(
207)
Haku
(
129)
(
597)
Sis
ingga
(
38)
Babata
naTu
(
711)
VarisiG
hon
(
710)
Varisi
(
538)
Ririo
(
584)
Sengga
(
35)
Babata
naAv
(
705)
Vaghua
(
34)
Babata
na
(
37)
Babata
naLo
(
36)
Babata
naKa
(
416)
(
334)
LungaLunga
(
413)
Mono
(
414)
MonoAlu
(
686)
Tora
u
(
700)
Uru
ava
(
415)
MonoFauro
(
75)
Bilur
(
571)
(
157)
(
128)
ChekeHolo
(
379)
MaringeKm
a
(
381)
MaringeTat
(
452)
NggaoPoro
(
380)
MaringeLel
(
732)
(
755)
ZabanaKia
(
302)
LaghuSam
as
(
124)
(
82)
Bla
bla
nga
(
83)
Bla
bla
ngaG
(
289)
Ko
ko
ta
(
282)
Kil
ok
ak
aY
s
(
49)
Ba
no
ni
(
316)
(
340)
Ma
da
ra
(
692)
Tu
ng
ag
Tu
ng
(
265)
Ka
raW
es
t
(
431)
Na
lik
(
673)
Tia
ng
(
674)
Tig
ak
(
324)
Lih
irS
un
gl
(
338)
(
339)
Ma
da
kL
am
as
(
53)
Ba
rok
(
741)
(
388)
Ma
utu
tu
(
430)
Na
ka
na
iBil
(
304)
La
ka
lai
(714)
Vit
u
(753)
Ya
pe
se
(
112)
(
618)
(190)
(92)
(393)
Mb
ug
ho
tuD
h
(95)
Bu
go
tu
(204)
(651)
Ta
lis
eM
oli
(650)
Ta
lis
eM
ala
(195)
Gh
ari
Nd
i (194)
Gh
ari
(681)
To
lo (648)
Ta
lis
e (347)
Ma
lan
go
(196)
Gh
ari
Ng
ga
e (392)
Mb
ira
o (199)
Gh
ari
Ta
nd
a (652)
Ta
lis
eP
ole
(197)
Gh
ari
Ng
ge
r (649)
Ta
lis
eK
oo
(198)
Gh
ari
Ng
ini
(189)
(319)
Le
ng
o (320)
Le
ng
oG
ha
im (321)
Le
ng
oP
ari
p (453)
Ng
ge
la
(
346)
(
345)
(
473)
(678)
To
am
ba
ita
(298)
Kw
ai
(312)
La
ng
ala
ng
a
(299)
Kw
aio
(
390)
Mb
ae
ng
gu
u
(
313)
La
u
(389)
Mb
ae
lele
a
(
301)
Kw
ara
ae
So
l
(
175)
Fa
tale
ka
(
315)
La
uW
ala
de
(
314)
La
uN
ort
h
(
328)
Lo
ng
gu
(
621)
(
551)
Sa
aU
kiN
iMa
(
550)
Sa
aS
aa
Vil
l
(
490)
Oro
ha
(
552)
Sa
aU
law
a
(
548)
Sa
a
(
142)
Do
rio
(
18)
Are
are
Ma
as
(
19)
Are
are
Waia
(
549)
SaaAulu
Vil
(
564)
(
58)
Bauro
Baro
o
(
59)
Bauro
Haunu
(
172)
Fagani
(
247)
KahuaM
am
i
(
570)
Santa
Cata
l
(
22)
Aro
siO
neib
(
23)
Aro
siT
awat
(
21)
Aro
si
(
246)
Kahua
(
60)
Bauro
PawaV
(
173)
FaganiA
guf
(
174)
FaganiR
ihu
(
569)
Santa
Ana
(
663)
Tawaro
ga
( 53
6)
(
709)
(
657)
(
658)
TannaSouth
(
300)
Kwam
era
(
318)
Lenakel
(
171)
(
636)
SyeErrom
an
(
699)
Ura
(
467)
(
726)
(
15)
Ara
kiS
outh
(
402)
Mere
i
(
150)
(
10)
Am
bry
mSout
(
420)
Mota
(
510)
Pete
rara
Ma
(
491)
Paam
eseSou
(
427)
Mwotlap
(
531)
Raga
(
119)
(
607)
South
Efa
te
(
489)
Ork
on
(
454)
Nguna
(
432)
Nam
akir
(
352)
(
434)
Nati
(
429)
Nah
avaq
(
444)
Nese
(
659)
Tape
(
12)
Anej
omAn
ei
(
557)
Saka
oPor
tO
(
351)
(
32)
Avava
(
445)
Neveei
(
433)
Naman
(
522)
(
406)
(
516)
(
520)
(
528)
Puluwate
se
(
527)
PuloAnnan
(
745)
Wole
aian
(
577)
Satawale
se
(
132)
ChuukeseAK
(
419)
Mortl
ockes
(
106)
Carolin
ian
(
131)
Chuukese
(
603)
Sonsorole
s
(
555)
SaipanCaro
(
526)
PuloAnna
(
744)
Wole
ai
(
515)
(
411)
Mokile
s e
(
512)
P ingila
pes
(
514)
Ponapean
(
283)
Kiribati
(
297)
Kusaie
(
385)
Mars
halles
(
436)
Nauru
( 33
2)
( 44
6)
( 47
4)
( 44
1)
Nelemwa
( 24
4)
Jawe
( 10
5)
Canala
( 21
4)
Iaai
( 44
3)
Nengone
( 13
9)
Dehu
( 116)
( 72
5)
( 54
3)
Rotuman
( 73
4)
Wes ternF ij
( 146) ( 513)
( 481)
( 15
6)
( 122)
( 64
5)
( 37
4) Maori
( 64
2) Tahiti
( 50
5) Penrhyn
( 68
9) Tuamotu
( 36
1) Manihiki
( 54
6) Rurutuan
( 64
3) TahitianMo
( 534) Rarotongan
(644) Tahitia
nth
( 384)
( 383) Marquesan
( 359) Mangareva
( 209) Hawaiian
( 533)
RapanuiEas
( 563)
( 562)
Samoan
( 182)
( 63) Bellona
( 675) Tikopia
( 163) Emae
( 13) Anuta
( 537) Rennelles e
( 180) FutunaAniw
( 218) IfiraMeleM
( 703) UveaWes t
( 181) FutunaEas t
( 704) VaeakauTau
( 162)
( 647) Takuu ( 694) Tuvalu ( 592) S ikaiana ( 264) Kapingamar
( 483) Nukuoro ( 333) Luangiua
( 524) Pukapuka ( 680) Tokelau ( 702) UveaEas t
( 683) ( 682) Tongan ( 459) Niue
( 177)F ijianBau
( 159) ( 707)
( 666)Teanu ( 708)Vano ( 654)Tanema ( 98)
Buma ( 701) ( 26)
Asumboa ( 656)
Tanimbili ( 442)
Nembao ( 1)
( 738) ( 581)
Seimat
( 748)
Wuvulu ( 160)
( 616) ( 330)
Lou ( 435)
Nauna ( 373)
( 153) ( 317)
Leipon ( 598)
S iv is aTita
( 729) ( 325)
L ikum ( 323)
Levei
( 329)
Loniu
( 426)
Mussau
( 609)
( 608)
( 266)
Kas ira Irah
( 200)
Giman
( 97)
Buli
( 108)
( 417)
Mor
( 532) ( 408)
M inyaifuin
( 68)
BigaMisool
( 25)
As
( 718)
Waropen
( 485)
Numfor
( 120)
( 378)
Marau
( 8)
AmbaiYapen
( 742)
W indes iWan ( 519)
( 33) ( 465)
( 460)
NorthBabar
( 135)
DaweraDawe
( 134)
Dai
( 612)
( 622)
( 667)
TelaMasbua
( 229)
Imroing
( 164)
Emplawas
( 386)
( 148)
Eas tMasela
( 588)
Serili
( 115)
CentralMas
( 606)
SouthEas tB
( 724)
Wes tDamar
( 24)
( 660)
TaranganBa
( 697)
UjirNAru
( 450)
NgaiborSAr
( 114) ( 152)
( 45)
( 192)
( 191)
Geser
( 719)
Watubela
( 161)
E latKeiBes
( 586) ( 486)
( 587) ( 9)
( 211)
HituAmbon
( 604)
SouAmanaTe
( 6)
Amahai
( 503)
Paulohi
( 698)
( 5)
Alune
( 425)
MurnatenAl
( 86)
Bonfia
( 722)
Werinam
a
( 101)
BuruNamrol
( 601)
Soboyo
( 77)
( 357)
Mam
boru
( 360)
Manggarai
( 449)
Ngadha
( 517)
Pondok
( 287)
Kodi
( 721)
WejewaTana
( 76)
Bima
( 599)
Soa
( 578)
Savu
( 716)
Wanukaka
( 186)
GauraNggau
( 149)
Eas tSumban
( 44)
Baliledo
( 308)
Lamboya
( 166)
( 326)
L ioF loresT
( 165)
Ende
( 428)
Nage
( 259)
Kam
bera
( 496)
Palu
eNitu
n
( 11)
Anakala
ng
( 461)
( 582)
Sekar
( 525)
Pula
uArg
un
( 676)
( 479)
( 731)
( 542)
RotiT
erm
an
( 29)
Ato
ni
( 155)
( 356)
Mam
bai
( 669)
Tetu
nTerik
( 277)
Kem
ak
( 178)
( 307)
Lam
ale
rale
( 306)
Lam
aholo
tI
( 591)
Sik
a
( 273)
Kedang
( 623)
( 740)
( 170)
Era
i
( 653)
Talu
r
( 14)
Aputa
i
( 506)
Pera
i
( 690)
Tugun
( 223)
Iliun
( 671)
( 457)
( 589)
Seru
a
( 456)
Nila
( 670)
Teun
( 286)
( 285)
Kis
ar
( 540)
Rom
a
( 145)
EastD
am
ar
( 322)
Letin
ese
( 617)
( 275)
( 751)
Yam
dena
( 274)
KeiT
anim
ba
( 583)
Sela
ru
( 288)
Koiw
aiIria
( 127)
Cham
orro
( 509)
( 310)
La
mp
un
g
( 290)
Ko
me
ring
( 634)
Su
nd
a
( 535)
Re
jan
gR
eja
( 74)
( 664)
( 665)
Tb
oliT
ag
ab
( 638)
Ta
ga
bili
( 81)
( 291)
Ko
ron
ad
alB
( 573)
Sa
ran
ga
niB
( 70)
Bila
an
Ko
ro
( 71)
Bila
an
Sa
ra
(179)
( 28)
( 133)
Ciu
liAta
ya
( 625)
Sq
uliq
Ata
y
( 580)
Se
diq
( 100)
Bu
nu
n
(737)
( 672)
Th
ao
(176)
Fa
vo
rlan
g
(545)
Ru
ka
i
(492)
Pa
iwa
n
(688)
(260)
Ka
na
ka
na
bu
(553)
Sa
aro
a
(687)
Ts
ou
(530)
Pu
yu
ma
(147)
(472)
(269)
Ka
va
lan
(54)
Ba
sa
i (109)
Ce
ntra
lAm
i (596)
Sira
ya
(504)
Pa
ze
h (556)
Sa
isia
t
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
!
"
œ
#
$
% &
'
(
)
*
+
,
-
f
.
gd
e
/
b
a
n
o
m
0
k
h
i
1
v
u
2
t
s
3
ø
qp 4
5
6 7
z
y
8
x
Figure S.5: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross-referenced with the code in parenthesis attached to each branch.
29
qaʀnkstpumŋrwio
others
0 0.075 0.150 0.225 0.300
(p, v)(o, ə)(r, d)(ʀ, r)
(p, b)(w, o)(i, e)(c, s)(u, i)(e, i)
(b, p)(o, w)(i, u)
(k, g)(j, s)
others
0 0.1 0.2 0.3 0.4
Fraction of substitution errors
ai
metrskulŋpqvn
others
0 0.075 0.150 0.225 0.300
Fraction of insertion errors
A B C
Fraction of deletion errors
Figure S.6: Most common substitution errors, insertion errors, and deletion errors in the PAn reconstructionsproduced by our system. In (A), the first phoneme in each pair (x, y) represents the reference phoneme, followedby the incorrectly hypothesized one. In (B), each phoneme corresponds to a phoneme present in the automaticreconstruction but not in the reference, and vice-versa in (C).
30