A Generative Model of...

14
A Generative Model of Phonotactics Richard Futrell Brain and Cognitive Sciences Massachusetts Institute of Technology [email protected] Adam Albright Department of Linguistics Massachusetts Institute of Technology [email protected] Peter Graff Intel Corporation [email protected] Timothy J. O’Donnell Department of Linguistics McGill University [email protected] Abstract We present a probabilistic model of phono- tactics, the set of well-formed phoneme se- quences in a language. Unlike most compu- tational models of phonotactics (Hayes and Wilson, 2008; Goldsmith and Riggle, 2012), we take a fully generative approach, model- ing a process where forms are built up out of subparts by phonologically-informed struc- ture building operations. We learn an inven- tory of subparts by applying stochastic memo- ization (Johnson et al., 2006; Goodman et al., 2008) to a generative process for phonemes structured as an and-or graph, based on con- cepts of feature hierarchy from generative phonology (Clements, 1985; Dresher, 2009). Subparts are combined in a way that allows tier-based feature interactions. We evaluate our models’ ability to capture phonotactic dis- tributions in the lexicons of 14 languages drawn from the WOLEX corpus (Graff, 2012). Our full model robustly assigns higher proba- bilities to held-out forms than a sophisticated N-gram model for all languages. We also present novel analyses that probe model be- havior in more detail. 1 Introduction People have systematic intuitions about which se- quences of sounds would constitute likely or un- likely words in their language: Although blick is not an English word, it sounds like it could be, while bnick does not (Chomsky and Halle, 1965). Such in- tuitions reveal that speakers are aware of the restric- tions on sound sequences which can make up possi- ble morphemes in their language—the phonotactics of the language. Phonotactic restrictions mean that each language uses only a subset of the logically, or even articulatorily, possible strings of phonemes. Admissible phoneme combinations, on the other hand, typically recur in multiple morphemes, lead- ing to redundancy. It is widely accepted that phonotactic judgments may be gradient: the nonsense word blick is better as a hypothetical English word than bwick, which is better than bnick (Hayes and Wilson, 2008; Al- bright, 2009; Daland et al., 2011). To account for such graded judgements, there have been a vari- ety of probabilistic (or, more generally, weighted) models proposed to handle phonotactic learning and generalization over the last two decades (see Da- land et al. (2011) and below for review). How- ever, inspired by optimality-theoretic approaches to phonology, the most linguistically informed and suc- cessful such models have been constraint-basedformulating the problem of phonotactic generaliza- tion in terms of restrictions that penalize illicit com- binations of sounds (e.g., ruling out * bn-). In this paper, by contrast, we adopt a generative approach to modeling phonotactic structure. Our approach harkens back to early work on the sound structure of lexical items which made use of mor- pheme structure rules or conditions (Halle, 1959; Stanley, 1967; Booij, 2011; Rasin and Katzir, 2014). Such approaches explicitly attempted to model the

Transcript of A Generative Model of...

Page 1: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

A Generative Model of Phonotactics

Richard FutrellBrain and Cognitive Sciences

Massachusetts Institute of [email protected]

Adam AlbrightDepartment of Linguistics

Massachusetts Institute of [email protected]

Peter GraffIntel Corporation

[email protected]

Timothy J. O’DonnellDepartment of Linguistics

McGill [email protected]

Abstract

We present a probabilistic model of phono-tactics, the set of well-formed phoneme se-quences in a language. Unlike most compu-tational models of phonotactics (Hayes andWilson, 2008; Goldsmith and Riggle, 2012),we take a fully generative approach, model-ing a process where forms are built up outof subparts by phonologically-informed struc-ture building operations. We learn an inven-tory of subparts by applying stochastic memo-ization (Johnson et al., 2006; Goodman et al.,2008) to a generative process for phonemesstructured as an and-or graph, based on con-cepts of feature hierarchy from generativephonology (Clements, 1985; Dresher, 2009).Subparts are combined in a way that allowstier-based feature interactions. We evaluateour models’ ability to capture phonotactic dis-tributions in the lexicons of 14 languagesdrawn from the WOLEX corpus (Graff, 2012).Our full model robustly assigns higher proba-bilities to held-out forms than a sophisticatedN-gram model for all languages. We alsopresent novel analyses that probe model be-havior in more detail.

1 Introduction

People have systematic intuitions about which se-quences of sounds would constitute likely or un-likely words in their language: Although blick is notan English word, it sounds like it could be, whilebnick does not (Chomsky and Halle, 1965). Such in-

tuitions reveal that speakers are aware of the restric-tions on sound sequences which can make up possi-ble morphemes in their language—the phonotacticsof the language. Phonotactic restrictions mean thateach language uses only a subset of the logically,or even articulatorily, possible strings of phonemes.Admissible phoneme combinations, on the otherhand, typically recur in multiple morphemes, lead-ing to redundancy.

It is widely accepted that phonotactic judgmentsmay be gradient: the nonsense word blick is betteras a hypothetical English word than bwick, whichis better than bnick (Hayes and Wilson, 2008; Al-bright, 2009; Daland et al., 2011). To account forsuch graded judgements, there have been a vari-ety of probabilistic (or, more generally, weighted)models proposed to handle phonotactic learning andgeneralization over the last two decades (see Da-land et al. (2011) and below for review). How-ever, inspired by optimality-theoretic approaches tophonology, the most linguistically informed and suc-cessful such models have been constraint-based—formulating the problem of phonotactic generaliza-tion in terms of restrictions that penalize illicit com-binations of sounds (e.g., ruling out ∗bn-).

In this paper, by contrast, we adopt a generativeapproach to modeling phonotactic structure. Ourapproach harkens back to early work on the soundstructure of lexical items which made use of mor-pheme structure rules or conditions (Halle, 1959;Stanley, 1967; Booij, 2011; Rasin and Katzir, 2014).Such approaches explicitly attempted to model the

Page 2: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

redudancy within the set of allowable lexical formsin a language. We adopt a probabilistic version ofthis idea, conceiving of the phonotactic system asthe component of the linguistic system which gen-erates the phonological form of lexical items suchas words and morphemes.1 Our system learns in-ventories of reusable phonotactically licit structuresfrom existing lexical items, and assembles new lex-ical items by combining these learned phonotac-tic patterns using phonologically plausible structure-building operations. Thus, instead of modelingphonotactic generalizations in terms of constraints,we treat the problem as a problem of learning lan-guage specific inventories of phonological units andlanguage specific biases on how these phones arelikely to be combined.

Although there have been a number of earlier gen-erative models of phonotactic structure (see Sec-tion 4) these models have mostly used relativelysimplistic or phonologically implausible representa-tions of phones and phonological structure-building.By contrast, our model is built around three repre-sentational assumptions inspired by the generativephonology literature. First, we capture sparsity inthe space of feature-specifications of phonemes byusing feature dependency graphs—an idea inspiredby work on feature geometries and the contrastivehierarchy (Clements, 1985; Dresher, 2009). Sec-ond, our system can represent phonotactic general-izations not only at the level of fully specified seg-ments, but also allows the storage and reuse of sub-segments, inspired by the autosegments and classnodes of autosegmental phonology. Finally, also in-spired by autosegmental phonology, we make use ofa structure-building operation which is senstitive totier-based contextual structure.

To model phonotactic learning, we make use oftools from Bayesian nonparametric statistics. In par-ticular, we make use of the notion of lexical memo-ization (Johnson et al., 2006; Goodman et al., 2008;Wood et al., 2009; O’Donnell, 2015)—the idea thatlanguage-specific generalizations can be capturedby the storage and reuse of frequent patterns from

1Ultimately, we conceive of phonotactics as the moduleof phonology which generates the underlying forms of lexicalitems, which are then subject to phonological transformations(i.e., transductions). In this work, however, we do not attemptto model transformations from underlying to surface forms.

a linguistically universal inventory. In our case,this amounts to the idea that an inventory of seg-ments and subsegments can be acquired by a learnerthat stores and reuses commonly occuring segmentsin particular, phonologically relevant contexts. Inshort, we view the problem learning the phonemeinventory as one of concentrating probability masson the those segments which have been observed be-fore, and the problem of phonotactic generalizationas learning which (sub-)segments are likely in par-ticular tier-based phonological contexts.

2 Model Motivations

In this section, we give an overview of how ourmodel works and discuss the motivating phenomenaand theoretical ideas that motivate it.

2.1 Feature Dependency Graphs

Most formal models of phonology posit that seg-ments are grouped into sets, known as naturalclasses, that are characterized by shared articulatoryand acoustic properties, or phonological features(Trubetzkoy, 1939; Jakobson et al., 1952; Chomskyand Halle, 1968). For example, the segments /n/ and/m/ are classified with a positive value of a nasal-ity feature (i.e., NASALITY:+). Similarly, /m/ and/p/ can be classified using the labial value of aPLACE feature, PLACE:labial. These features al-low compact description of many phonotactic gen-eralizations.2

From a probabilistic structure-building perspec-tive, we need to specify a generative procedurewhich assembles segments out of parts defined interms of these features. In this section, we will buildup such a procedure starting from the simplest possi-ble procedure and progressing towards one which ismore phonologically informed. We will clarify the

2For compatibility with the data sources used in evaluation(Section 5.2), the feature system we use here departs in severalways from standard feature sets: (1) We use multivalent ratherthan binary-valued features. (2) We represent manner with asingle feature, which has values such as vocalic, stop, andfricative. This approach allows us to refer to manners morecompactly than in systems that employ combinations of featuressuch as sonorant, continuant, and consonantal. Forexample, rather than referring to vowels as ‘non-syllabic’, werefer to them using feature value vocalic for the featureMANNER.

Page 3: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

generative process here using an analogy to PCFGs,but this analogy will break down in later sections.

The simplest procedure for generating a seg-ment from features is to specify each featureindependently. For example, consider the setof feature-value pairs for /t/: {NASALITY:-,PLACE:alveolar, ...}. In a naive generative pro-cedure, one could generate an instance of /t/ by inde-pendently choosing values for each feature in the set{NASALITY, PLACE, ...}. We express this processusing the and-or graph notation below. Box-shapednodes—called or-nodes—represent features such asNASALITY, while circular nodes represent groups offeatures whose values are chosen independently andare called and-nodes.

NASALITY ... PLACE

This generative procedure is equivalent (ignoring or-der) to a PCFG with rules:

SEGMENT→ NASALITY ... PLACENASALITY→ +NASALITY→ -PLACE→ bilabialPLACE→ alveolar...

Not all combinations of feature-value pairs cor-respond to possible phonemes. For example, while/l/ is distinguished from other consonants by thefeature LATERAL, it is incoherent to specify vow-els as LATERAL. In order to concentrate probabil-ity mass on real segments, our process should opti-mally assign zero probability mass to these incoher-ent phonemes. We can avoid specifying a LATERAL

feature for vowels by structuring the generative pro-cess as below, so that the LATERAL or-node is onlyreached for consonants:

VOCALIC

A

LATERAL ...

B

HEIGHT ...

consonant vowel

Beyond generating well-formed phonemes, a ba-sic requirement of a model of phonotactics is thatit concentrates mass only on the segments in a par-ticular language’s segment inventory. For exam-ple, the model of English phonotactics should put

zero or nominal mass on any sequence containingthe segment /x/, although this is a logically possi-ble phoneme. So our generative procedure for aphoneme must be able to learn to generate only thelicit segments of a language, given some probabil-ity distributions at the and- and or-nodes. For thistask, independently sampling values at and-nodesdoes not give us a way to rule out particular com-binations of features such as those forming /x/.

Our approach to this problem uses the idea ofstochastic memoization (or adaptation), in whichthe results of certain computations are stored andmay be probabilistically reused “as wholes,” ratherthan recomputed from scratch (Michie, 1968; John-son et al., 2006; Goodman et al., 2008). This tech-nique has been applied to the problem of learninglexical items at various levels of linguistic structure(de Marcken, 1996; Johnson et al., 2006; Goldwa-ter, 2006; O’Donnell, 2015). Given our model sofar, applying stochastic memoization is equivalentto specifying an adaptor grammar over the PCFGsdescribed so far (Johnson et al., 2006).

Let f be a stochastic function which samplesfeature values using the and-or graph representa-tion described above.We apply stochastic memo-ization to each node. Following Johnson et al.(2007) and Goodman et al. (2008), we use a distri-bution for probabilistic memoization known as theDirichlet Process (DP) (Ferguson, 1973; Sethura-man, 1994). Let mem{f} be a DP-memoized ver-sion of f . The behavior of a DP-memoized functioncan be described as follows. The first time we invokemem{f}, the feature specification of a new segmentwill be sampled using f . On subsequent invocations,we either choose a value from among the set of pre-vious sampled values (a memo draw), or we draw anew value from f (a base draw). The probability ofsampling the ith old value in a memo draw is ni

N+θ ,where N is the number of tokens sampled so far, niis the number of times that value i has been used inthe past, and θ > 0 is a parameter of the model. Abase draw happens with probability θ

N+θ . This pro-cess induces a bias to reuse items from f which havebeen frequently generated in the past.

We apply mem recursively to the sampling proce-dure for each node in the feature dependency graph.The more times that we use some particular set offeatures under a node to generate words in a lan-

Page 4: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

guage, the more likely we are to reuse that set offeatures in the future in a memo draw. This dynamicleads our model to rapidly concentrate probabilitymass on the subset of segments which occur in theinventory of a language.

2.2 Class Node Structure

Our use of and-or graphs and lexical memoiza-tion to model inter-feature dependencies is in-spired by work in phonology on distinctivenessand markedness hierarchies (Kean, 1975; Berwick,1985; Dresher, 2009). In addition to using featurehierarchies to delineate possible segments, the liter-ature has used these structures to designate bundlesof features that have privileged status in phonolog-ical description, i.e. feature geometries (Clements,1985; Halle, 1995; McCarthy, 1988). For example,many analyses group features concerning laryngealstates (e.g., VOICE, ASPIRATION, etc.) under a la-ryngeal node, which is distinct from the node con-taining oral place-of-articulation features (Clementsand Hume, 1995). These nodes are known as classnodes. In these analyses, features grouped togetherunder the laryngeal class node may covary while be-ing independent of features grouped under the oralclass node.

The lexical memoization technique discussedabove captures this notion of class node directly, be-cause the model learns an inventory of subsegmentsunder each node.

Consider the feature dependency graph below.A

B

NASALITY VOICE ...

VOCALIC

C

BACKNESS HEIGHT ...

...consonantvowel

In this graph, the and-node A generates fully spec-ified segments. And-node B can be thought of asgenerating the non-oral properties of a segment, in-cluding voicing and nasality. And-node C is a classnode bundling together the oral features of vowelsegments.

The features under B are outside of the VO-CALIC node, so these features are specified for bothconsonant and vowel segments. This allowscombinations such as voiced nasal consonants, and

also rarer combinations such as unvoiced nasal vow-els. Because all and-nodes are recursively memo-ized, our model is able to bind together particularnon-oral choices (node B), learning for instance thatthe combination {NASALITY:+, VOICED:+} com-monly recurs for both vowels and consonants in alanguage. That is, {NASALITY:+, VOICED:+} be-comes a high-probability memo draw.

Since the model learns an inventory of fully spec-ified segments at node A, the model could learn one-off exceptions to this generalization as well. Forexample, it could store at a high level a segmentwith {NASALITY:+, VOICED:-} along with someother features, while maintaining the generalizationthat {NASALITY:+, VOICED:+} is highly frequent inbase draws. Language-specific phoneme invento-ries abound with such combinations of class-node-based generalizations and idiosyncrasies. By usinglexical memoization at multiple different levels, ourmodel can capture both the broader generalizationsdescribed in class node terminology and the excep-tions to those generalizations.

2.3 Sequential Structure as Memoization inContext

In Section 2.2, we focused on the role that featuresplay in defining a language’s segment inventory. Wegave a phonologically-motivated generative process,equivalent to an adaptor grammar, for phonemesin isolation. However, features also play an im-portant role in characterizing licit sequences. Wemodel sequential restrictions as context-dependentsegment inventories. Our model learns a distributionover segments and subsegments conditional on eachpreceding sequence of (sub)segments, using lexi-cal memoization. Introducing context-dependencemeans that the model can no longer be formulatedas an adaptor grammar.

2.4 Tier-based Interaction

One salient property of sequential restrictions inphonotactics is that segments are often required tobear the same feature values as nearby segments.For example, a sequence of a nasal and a follow-ing stop must agree in place features at the end of amorpheme in English. Such restrictions may evenbe non-local. For example, many languages pre-fer combinations of vowels that agree in features

Page 5: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

Figure 1: Tiers defined by class nodes A and B for contextsequence /ak/. See text.

such as HEIGHT, BACKNESS, or ROUNDING, evenacross arbitrary numbers of intervening consonants(i.e., vowel harmony).

One way to describe these sequential feature in-teractions is to assume that feature values of onesegment in a word depend on values for the sameor closely related features in other segments. Thisis accomplished by dividing segments into subsets(such as consonants and vowels), called tiers, andthen making a segment’s feature values preferen-tially dependent on the values of other segments onthe same tier.

Such phonological tiers are often identified withclass nodes in a feature dependency graph. Forexample, a requirement that one vowel identicallymatch the vowel in the preceding syllable would bestated as a requirement that the vowel’s HEIGHT,BACKNESS, and ROUNDING features match the val-ues of the preceding vowel’s features. In this case,the vowels themselves need not be adjacent—by as-suming that vowel quality features are not presentin consonants, it is possible to say that two vowelsare adjacent on a tier defined by the nodes HEIGHT,BACKNESS, and ROUNDING.

Our full generative process for a segment follow-ing other segments is the following. We follow theexample of the generation of a phoneme conditionalon a preceding context of /ak/, shown with simpli-fied featural specifications and tiers in Figure 1.

At each node in the feature dependency graph, wecan either generate a fully-specified subsegment forthat node (memo draw), or assemble a novel subseg-ment for that node out of parts defined by the fea-ture dependency graph (base draw). Starting at theroot node of the feature dependency graph, we de-cide whether to do a memo draw or base draw con-

ditional on the previous n subsegments at that node.So in order to generate the next segment follow-

ing /ak/ in the example, we start at node A in the nextdraw from the feature geometry, with some probabil-ity we do a memo draw conditioned on /ak/, definedby the red tier. If we decide to do a base draw in-stead, we then repeat the procedure conditional onthe previous n − 1 segments, recursively until weare conditioning on the empty context. That is, wedo a memo draw conditional on /k/, or conditionalon the empty context. This process of conditioningon successively smaller contexts is a standard tech-nique in Bayesian nonparametric language modeling(Teh, 2006; Goldwater et al., 2006).

At the empty context, if we decide to do a basedraw, then we generate a novel segment by repeat-ing the whole process at each child node, to gen-erate several subsegments. In the example, wewould assemble a phoneme by independently sam-pling subsegments at the nasal/laryngeal node B andthe MANNER node, and then combining them. Cru-cially, the conditioning context consists only of thevalues at the current node in the previous phonemes.So when we sample a subsegment from node B, it isconditional on the previous two values at node B,{ VOICE:+, NASAL:-} and { VOICE:-, NASAL:-},defined by the blue tier in the figure. The processcontinues down the feature dependency graph recur-sively. At the point where the model decides onvowel place features such as height and backness,these will be conditioned only on the vowel placesfeatures of the preceding /a/, with /k/ skipped en-tirely as it does not have values at vowel place nodes.

This section has provided motivations and a walk-through of our proposed generative procedure for se-quences of segments. In the next section, we give theformalization of the model.

3 Formalization of the Models

Here we give a full formal description of our pro-posed model in three steps. First, in Section 3.1,we formalize the generative process for a segmentin isolation. Second, in Section 3.2, we give for-mulation of Bayesian nonparametric N-gram mod-els with backoff. Third, in Section 3.3, we showhow to drop the generative process for a phonemeinto the N-gram model such that tier-based interac-

Page 6: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

tions emerge naturally.

3.1 Generative Process for a Segment

A feature dependency graph G is a fully connected,singly rooted, directed, acyclic graph given by thetriple 〈V,A,t, r〉 where V is a set of vertices ornodes,A is a set of directed arcs, t is a total functiont(n) : V 7→ {and,or}, and r is a distinguishedroot node in V . A directed arc is a pair 〈p, c〉 wherethe parent p and child c are both elements in V . Thefunction t(n) identifies whether n is an and- or or-node. Define ch(n) to be the function that returnsall children of node n, that is, all n′ ∈ N such that〈n, n′〉 ∈ A.

A subgraph Gs of feature dependency graph Gis the graph obtained by starting from node s by re-taining only nodes and arcs reachable by traversingarcs starting from s. A subsegment ps is a subgraphrooted in node s for which each or-node contains ex-actly one outgoing arc. Subsegments represent sam-pled phone constituents. A segment is a subsegmentrooted in r—that is, a fully specified phoneme.

The distribution associated with a subgraph Gs isgiven by Gs below. Gs is a distribution over sub-segments; the distribution for the full graph Gr is adistribution over fully specified segments. We oc-casionally overload the notation such that Gs(ps)will refer to the probability mass function associatedwith distribution Gs evaluated at the subsegment ps.

Hs ∼ DP(θs, Gs) (1)

Gs(ps) =

s′∈ch(s)

Hs′

(ps′

) t(s) = AND

∑s′∈ch(s)

ψss′H

s′(ps′

) t(s) = OR

The first case of the definition covers and-nodes.We assume that the leaves of our feature dependencygraph—which represent atomic feature values suchas the laryngeal value of a PLACE feature—arechildless and-nodes.

The second case of the definition covers or-nodesin the graph, where ψss′ is the probability associatedwith choosing outgoing arc 〈s, s′〉 from parent or-node s to child node s′. Thus, or-nodes define mix-ture distributions over outgoing arcs. The mixtureweights are drawn from a Dirichlet process. In par-ticular, for or-node n in the underlying graph G, the

vector of probabilities over outgoing edges is dis-tributed as follows.

~ψs ∼ DP(θs, UNIFORM(|ch(s)|))

Note that in both cases the distribution over childsubgraphs is drawn from a Dirichlet process, as be-low, capturing the notion of subsegmental storagediscussed above.

3.2 N-Gram Models with DP-BackoffLet T be a set of discrete objects (e.g., atomic sym-bols or structured segments as defined in the preced-ing sections). Let T ∗ be the set of all finite-lengthstrings which can be generated by combining ele-ments of T , under concatenation, ·, including theempty string ε. A context, u is any finite string be-ginning with a special distinguished start sym-bol and ending with some sequence in T ∗, that is,u ∈ {start · T ∗}.

For any string α, define hd(α) to be the functionthat returns the first symbol in the string, tl(α) tobe the function that returns suffix of αminus the firstsymbol, and |α| to be the length of α, with hd(ε) =tl(ε) = ε and |ε| = 0. Write the concatenation oftwo strings α and α′ as α · α′.

Let Hu be a distribution on next symbols—thatis, objects in T ∪ {stop}—conditioned on a givencontext u. For an N-gram model of order N , theprobability of a string β in T ∗ is given byKN

start(β ·stop), where KN

u (α) is defined as:

KNu (α) =

{1 α = ε

HfN (u)(hd(α))×KNu·hd(α)(tl(α)) otherwise ,

(2)where fn(·) is a context-management functionwhich determines which parts of the left-contextshould be used to determine the probability of thecurrent symbol. In the case of the N-gram modelsused in this paper, fn(·) takes a sequence u and re-turns only the rightmost n − 1 elements from thesequence, or the entire sequence if it has length lessthan n.

Note two aspects of this formulation of N-grammodels. First, Hu is a family of distributions overnext symbols or more general objects. Later, we willdrop in phonological-feature-based generative pro-cesses for these distributions. Second, the functionfn is a parameter of the above definitions. In whatfollows, we will use a variant of this function which

Page 7: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

is sensitive to tier-based structure, returning the pre-vious n− 1 only on the appropriate tier.

MacKay and Peto (1994) introduced a hierarchi-cal Dirichlet process-based backoff scheme for N-gram models, with generalizations in Teh (2006) andGoldwater et al. (2006). In this setup, the distribu-tion over next symbols given a context u is drawnhierarchically from a Dirichlet process whose basemeasure is another Dirichlet process associated withcontext tl(u), and so on, with all draws ultimatelybacking off into some unconditioned distributionover all possible next symbols. That is, in a hier-archical Dirichlet process N-gram model, Hfn(u) isgiven as follows.

Hfn(u) ∼{

DP(θfn(u), Hfn−1(u)) n ≥ 1

DP(θfn(u), UNIFORM(T ∪ {stop})) n = 0

3.3 Tier-Based InteractionsTo make the N-gram model defined in the last sec-tion capture tier-based interactions, we make twochanges. First, we generalize the generative pro-cess Hs from Equation 1 to Hs

u, which generatessubsegments conditional on a sequence u. Sec-ond, we define a context-truncating function fsn(u)which takes a context of segments u and returns therightmost n− 1 non-empty subsegments whose rootnode is s. Then we substitute the generative pro-cessHs

fsn(u)(which applies the context-management

function fsn(·) to the context u) for Hfn(u) in Equa-tion 2. The resulting probability distribution is:

KNu (α) =

{1 α = ε

HrfrN

(u)(hd(α))×KNu·hd(α)(tl(α)) otherwise .

KNu (α) is the distribution over continuations

given a context of segments. Its definition dependson Hs

fsn(u), which is the generalization of the gener-

ative process for segments Hs to be conditional onsome tier-based N-gram context fsn(u). H

sfsn(u)

is:

Hsfsn(u)

∼{

DP(θsfsn(u), Hsfsn−1

(u)) n ≥ 1

DP(θsfsn(u), GsfsN

(u)) n = 0

Gsfsn(u)(ps) =

{ ∏s′∈ch(s)H

s′

fs′n (u)(ps′) t(s) = AND∑

s′∈ch(s) ψss′H

s′

fs′n (u)(ps′) t(s) = OR.

Hsfsn(u)

and Gsfsn(u)above are mutually recursive

functions. Hsfsn(u)

implements backoff in the tier-based context of previous subsegments; Gsfsn(u) im-plements backoff by going down into the probabil-ity distributions defined by the feature dependencygraph.

Note that the function Hsfsn(u)

recursively backsoff to the empty context, but its ultimate base distri-bution is indexed by fsN (u), using the global maxi-mum N-gram order N . So when samples are drawnfrom the feature dependency graph, they are con-ditioned on non-empty tier-based contexts. In thisway, subsegments are generated based on tier-basedcontext and based on featural backoff in an inter-leaved fashion.

3.4 Inference

We use the Chinese Restaurant Process represen-tation for sampling. Inference in the model isover seating arrangements for observations of sub-segments and over the hyperparameters θ for eachrestaurant. We perform Gibbs sampling on seatingarrangements in the Dirichlet N-gram models by re-moving and re-adding observations in each restau-rant. These Gibbs sweeps had negligible impacton model behavior. For the concentration parame-ter θ, we set a prior Gamma(10, .1). We draw pos-terior samples using the slice sampler described inJohnson and Goldwater (2009). We draw one pos-terior sample of the hyperparameters for each Gibbssweep. In contrast to the Gibbs sweeps, we found re-sampling hyperparameters to be crucial for achiev-ing the performance described below (Section 5.3).

4 Related Work

Phonotactics has proven a fruitful problem domainfor computational models. Most such work hasadopted a constraint-based approach, attempting todesign a scoring function based on phonological fea-tures to separate acceptable forms from unaccept-able ones, typically by formulating restrictions orconstraints to rule out less-good structures.

This concept has led naturally to the use of undi-rected (maximum-entropy, log-linear) models. Inthis class of models, a form is scored by evaluationagainst a number of predicates, called factors3—forexample, whether two adjacent segments have thephonological features VOICE:+ VOICE:-. Each fac-tor is associated with a weight, and the score for aform is the sum of the weights of the factors whichare true for the form. The well-known model of

3Factors are also commonly called “features”—a term weavoid to prevent confusion with phonological features.

Page 8: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

Hayes and Wilson (2008) adopts this framework,pairing it with a heuristic procedure for finding ex-planatory factors while preventing overfitting. Simi-larly, Albright (2009) assigns a score to forms basedon factors defined over natural classes of adjacentsegments. Constraint-based models have the advan-tage of flexibility: it is possible to score forms usingarbitrarily complex and overlapping sets of factors.For example, one can state a constraint against ad-jacent phonemes having features VOICE:+ and LAT-ERAL:+, or any combination of feature values.

In contrast, we have presented a model whereforms are built out of parts by structure-building op-erations. From this perspective, the goal of a modelis not to rule out bad forms, but rather to discoverrepeating structures in good forms, such that newforms with those structures can be generated.

In this setting there is less flexibility in howphonological features can affect well-formedness.For a structure-building model to assign “scores” toarbitrary pairs of co-occurring features, there mustbe a point in the generative process where those fea-tures are considered in isolation. Coming up withsuch a process has been challenging. As a result ofthis limitation, structure-building models of phono-tactics have not generally included rich featural in-teractions. For example, Coleman and Pierrehum-bert (1997) give a probabilistic model for phonotac-tics where words are generated using grammar overunits such as syllables, onsets, and rhymes. Thismodel does not incorporate fine-grained phonolog-ical features such as voicing and place.

In fact, it has been argued that a constraint-based approach is required in order to capture richfeature-based interactions. For example, Goldsmithand Riggle (2012) develop a tier-based structure-building model of Finnish phonotactics which cap-tures nonlocal vowel harmony interactions, but ar-gue that this model is inadequate because it doesnot assign higher probabilities to forms than an N-gram model, a common baseline model for phono-tactics (Daland et al., 2011). They argue that thisdeficiency is because the model cannot simulta-neously model nonlocal vowel-vowel interactionsand local consonant-vowel interactions. Because ofour tier-based conditioning mechanism (Sections 2.4and 3.3), our model can simultaneously produce lo-cal and nonlocal interactions between features us-

ing structure-building operations, and does assignhigher probabilities to held-out forms than an N-gram model (Section 5.3). From this perspective,our model can be seen as a proof of concept that itis possible to have rich feature-based conditioningwithout adopting a constraint-based approach.

While our model can capture featural interactions,it is less flexible than a constraint-based model inthat the allowable interactions are specified by thefeature dependency graph. For example, there isno way to encode a direct constraint against adja-cent phonemes having features VOICE:+ and LAT-ERAL:+. We consider this a strength of the ap-proach: A particular feature dependency graph isa parameter of our model, and a specific scientifichypothesis about the space of likely featural interac-tions between phonemes, similar to feature geome-tries from classical generative phonology (Clements,1985; McCarthy, 1988; Halle, 1995).4

While probabilistic approaches have mostly takena constraint-based approach approach, recent for-mal language theoretic approaches to phonologyhave investigated what basic parts and structurebuilding operations are needed to capture realis-tic feature-based interactions (Heinz et al., 2011;Jardine and Heinz, 2015). We see probabilisticstructure-building approaches such as this work asa way to unify the recent formal language theoreticadvances in computational phonology with compu-tational phonotactic modeling.

Our model joins other NLP work attempting todo sequence generation where each symbol is gen-erated based on a rich featural representation ofprevious symbols (Bilmes and Kirchhoff, 2003;Duh and Kirchhoff, 2004), though we focus moreon phonology-specific representations. Our and-orgraphs are similar to those used in computer visionto represent possible objects (Jin and Geman, 2006).

5 Model Evaluation and Experiments

Here we evaluate some of the design decisions ofour model and compare it to a baseline N-grammodel and to a widely-used constraint-based model,BLICK. In order to probe model behavior, we also

4We do however note that it may be possible to learn featurehierarchies on a language-by-language basis from universal ar-ticulatory and acoustic biases, as suggested by Dresher (2009).

Page 9: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

present evaluations on artificial data, and a samplingof “representative forms” preferred by one model ascompared to another.

Our model consists of structure-building opera-tions over a learned inventory of subsegments. If ourmodel can exploit more repeated structure in phono-logical forms than the N-gram model or constraint-based models, then it should assign higher probabil-ities to forms. The log probability of a form under amodel corresponds to the description lengths of thatform under the model; if a model assigns a higherlog probability to a form, that means the model is ca-pable of compressing the form more than other mod-els. Therefore, we compare models on their abilityto assign high probabilities to phonological forms,as in Goldsmith and Riggle (2012).

5.1 Evaluation of Model Components

We are interested in discovering the extent to whicheach model component described above— featuredependency graphs (Section 2.1), class node struc-ture (Section 2.2), and tier-based conditioning (Sec-tion 2.4)— contributes to the ability of the model toexplain wordforms.

To evaluate the contribution of feature depen-dency graphs, we compare our models with a base-line N-gram model, which represents phonemes asatomic units. For this N-gram model, we use a Hier-archical Dirichlet Process with n = 3.

To evaluate feature dependency graphs with andwithout articulated class node structure, we com-pare models using the graph shown in Figure 3(the minimal structure required to produce well-formed phonemes) to models with the graph shownin Figure 2, which includes phonologically moti-vated “class nodes”.5

To evaluate tier-based conditioning, we comparemodels with the conditioning described in Sec-tions 2.4 and 3.3 to models where all decisions areconditioned on the full featural specification of theprevious n − 1 phonemes. This allows us to isolateimprovements due to tier-based conditioning beyondimprovements from the feature hierarchy.

5These feature dependency graphs differ from those in theexposition in Section 2 in that they do not include a MANNER

feature; but rather treat vowel as a possible value of MANNER.

duration

laryngeal nasal

manner

suprasegmental

backness height rounding

C place 2nd art. lateral

otherwisevowel

Figure 2: Feature dependency graph with class nodestructure used in our experiments. Plain text nodes areOR-nodes with no child distributions. The arc markedotherwise represents several arcs, each labelled with aconsonant manner such as stop, fricative, etc.

duration laryngeal nasal manner

suprasegmental backness height rounding C place2nd art. lateral

otherwisevowel

Figure 3: “Flat” feature dependency graph.

5.2 Lexicon Data

The WOLEX corpus provides transcriptions forwords in dictionaries of 60 diverse languages, rep-resented in terms of phonological features (Graff,2012). In addition to words, the dictionaries in-clude some short set phrases, such as of course. Weuse the featural representation of WOLEX, and de-sign our feature dependency graphs to generate onlywell-formed phonemes according to this feature sys-tem. For space reasons, we present the evaluationour model on 14 of these languages, chosen basedon the quality of their transcribed lexicons, and theauthors’ knowledge of their phonological systems.

5.3 Held-Out Evaluation

Here we test whether the different model configu-rations described above assign high probability toheld-out forms. This tests the models’ ability togeneralize beyond their training data. We train eachmodel on 2500 randomly selected wordforms froma WOLEX dictionary, and compute posterior predic-tive probabilities for the remaining wordforms fromthe final state of the model.

Page 10: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

Language ngram flat cl. node flat/no tiers cl.node/no tiersEnglish -22.20 -22.15 -21.73∗∗ -22.15 -22.14French -18.30 -18.28 -17.93∗∗ -18.29 -18.28Georgian -20.21 -20.17 -19.64∗ -20.18 -20.18German -24.77 -24.72 -24.07∗∗ -24.73 -24.74Greek -22.48 -22.45 -21.65∗∗ -22.45 -22.45Haitian Creole -16.09 -16.04 -15.82∗∗ -16.05 -16.04Lithuanian -19.03 -18.99 -18.58∗ -18.99 -18.99Mandarin -13.95 -13.83∗ -13.78∗∗ -13.82∗ -13.82∗

Mor. Arabic -16.15 -16.10 -16.00∗ -16.13 -16.12Polish -20.12 -20.08 -19.76∗∗ -20.08 -20.07Quechua -14.35 -14.30 -13.87∗ -14.30 -14.31Romanian -18.71 -18.68 -18.32∗∗ -18.69 -18.68Tatar -16.21 -16.18 -15.65∗∗ -16.19 -16.19Turkish -18.88 -18.85 -18.55∗∗ -18.85 -18.84

Table 1: Average log posterior predictive probability of aheld-out form. “ngram” is the DP Backoff 3-gram model.“flat” models use the feature dependency graph in Fig-ure 3. “cl. node” models use the graph in Figure 2. Seetext for motivations of these graphs. “no tiers” modelscondition each decision on the previous phoneme, ratherthan on tiers of previous features. Asterisks indicate sta-tistical significance according to a t-test comparing withthe scores under the N-gram model. * = p < .05; **= p < .001.

Table 1 shows the average probability of a held-out word under our models and under the N-grammodel for one model run.6 For all languages, weget a statistically significant increase in probabili-ties by adopting the autosegmental model with classnodes and tier-based conditioning. Model variantswithout either component do not significantly out-perform the N-gram model except in Chinese. Thecombination of class nodes and tier-based condition-ing results in model improvements beyond the con-tributions of the individual features.

5.4 Evaluation on Artificial DataOur model outperforms the N-gram model in pre-dicting held-out forms, but it remains to be shownthat this performance is due to capturing the kindsof linguistic intuitions discussed in Section 2. Analternative possibility is that the Autosegmental N-gram model, which has many more parameters thana plain N-gram model, can simply learn a more ac-curate model of any sequence, even if that sequencehas none of the structure discussed above. To evalu-ate this possibility, we compare the performance ofour model in predicting held-out linguistic forms toits performance in predicting held-out forms fromartificial lexicons which expressly do not have the

6The mean standard deviation per form of log probabilitiesover 50 runs of the full model ranged from .09 for Amharic to.23 for Dutch.

linguistic structure we are interested in.If the autosegmental model outperforms the N-

gram model even on artificial data with no phono-logical structure, then its performance on the reallinguistic data in Section 5.3 might be overfitting.On the other hand, if the autosegmental model doesbetter on real data but not artificial data, then we canconclude that it is picking up on some real distinc-tive structure of that data.

For each real lexicon Lr, we generate an artifi-cial lexicon La by training a DP 3-gram model onLr and forward-sampling |Lr| forms. Additionally,the forms in La are constrained to have the samedistribution over lengths as the forms in Lr. Theresulting lexicons have no tier-based or featural in-teractions except as they appear from chance fromthe N-gram model trained on these lexica. For eachLa we then train our models on the first 2500 formsand score the probabilities of the held-out forms, thesame procedure as in Section 5.3.

We ran this procedure for all the lexicons shownin Table 1. For all but one lexicon, we find that theautosegmental models do not significantly outper-form the N-gram models on artificial data. The ex-ception is Mandarin Chinese, where the average logprobability of an artificial form is −13.81 under theN-gram model and −13.71 under the full autoseg-mental model. The result suggests that the anoma-lous behavior of Mandarin Chinese in Section 5.3may be due to overfitting.

When exposed to data that explicitly does nothave autosegmental structure, the model is not moreaccurate than a plain sequence model for almost alllanguages. But when exposed to real linguistic data,the model is more accurate. This result provides ev-idence that the generative model developed in Sec-tion 2 captures true distributional properties of lexi-cons that are absent in N-gram distributions, such asfeatural and tier-based interactions.

5.5 Comparison with a Constraint-BasedModel

Here we provide a comparison with Hayes andWilson (2008)’s Phonotactic Learner, which out-puts a phonotactic grammar in the form of a setof weighted constraints on feature co-occurrences.This grammar is optimized to match the constraintviolation profile in a training lexicon, and so can be

Page 11: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

seen as a probabilistic model of that lexicon. Theauthors have distributed one such grammar, BLICK,as a “reference point for phonotactic probability inexperimentation” (Hayes, 2012). Here we compareour model against BLICK on its ability to assignprobabilities to forms, as in Section 5.3.

Ideally, we would simply compute the probabil-ity of forms like we did in our earlier model com-parisons. BLICK returns scores for each form.However, since the probabilistic model underlyingBLICK is undirected, these scores are in fact unnor-malized log probabilities, so they cannot be com-pared directly to the normalized probabilities as-signed by the other models. Furthermore, becausethe probabilistic model underlying BLICK does notpenalize forms for length, the normalizing constantover all forms is in fact infinite, making straightfor-ward comparison of predictive probabilities impos-sible. Nevertheless, we can turn BLICK scores intoprobabilities by conditioning on further constraints,such as the length k of the form. We enumerate allpossible forms of length k to compute the normal-izing constant for the distribution over forms of thatlength. The same procedure can also be used to com-pute the probabilities of each form, conditioned onthe length of the form k, under the N-gram and Au-tosegmental models.

To compare our models against BLICK, we cal-culate conditional probabilities for forms of length2 through 5 from the English lexicon.7 The formsare those in the WOLEX corpus; we include themfor this evaluation if they are k symbols long in theWOLEX representation. For our N-gram and Au-tosegmental models, we use the same models as inSection 5.3. The average probabilities of forms un-der the three models are shown in Table 2. Forlength 3-5, the autosegmental model assigns thehighest probabilities, followed by the N-gram modeland BLICK. For length 2, BLICK outperforms theDP N-gram model but not the autosegmental model.

Our model assigns higher probabilities to shortforms than BLICK. That is, our models have iden-tified more redundant structure in the forms thanBLICK, allowing them to compress the data more.However, the comparison is imperfect in several

7Enumerating and scoring the 22,164,361,129 possibleforms of length 6 was computationally impractical.

Length BLICK ngram cl. node2 -6.50 -6.81 -5.183 -9.38 -8.76 -7.954 -14.1 -11.7 -11.45 -18.1 -14.2 -13.9

Table 2: Average log posterior predictive probability ofan English form of fixed length under BLICK and ourmodels.

English N-gram English Full Modelcollaborationist mistrustfula posteriori inharmoniousnesssacristy absentmindednessmatter of course blamelessnessearnest money phlegmatically

Table 3: Most representative forms for the N-gram modeland for our full model (“cl. node” in Table 1) in En-glish. Forms are presented in native orthography, butwere scored based on their phonetic form.

ways. First, BLICK and our models were trained ondifferent data; it is possible that our training data aremore representative of our test data than BLICK’straining data were. Second, BLICK uses a differentunderlying featural decomposition than our models;it is possible that our feature system is more ac-curate. Nevertheless, these results show that ourmodel concentrates more probability mass on (short)forms attested in a language, whereas BLICK likelyspreads its probability mass more evenly over thespace of all possible (short) strings.

5.6 Representative Forms

In order to get a sense of the differences betweenmodels, we investigate what phonological forms arepreferred by different kinds of models. These formsmight be informative about the phonotactic patternsthat our model is capturing which are not well-represented in simpler models. We calculate the rep-resentativeness of a form f with respect to modelm1 as opposed to m2 as p(f |m1)/p(f |m2) (Good,1965; Tenenbaum and Griffiths, 2001). The formsthat are most “representative” of model m1 are notthe forms thatm1 assigns the highest probability, butrather the forms thatm1 ranks highest relative tom2.

Tables 3 and 4 show forms from the lexicon thatare most representative of our full model and of theN-gram model for English and Turkish. The most

Page 12: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

Turkish N-gram Turkish Full Modelüstfamilya büyükkarapınardekstrin kızılcapınarmnemotekni altınpınarekskavatör sarımehmetlerfoksterye karaelliler

Table 4: Most representative forms for N-gram and Au-tosegmental models in Turkish.

uniquely representative forms for our full model aremorphologically complex forms consisting of manyproductive, frequently reused morphemes such asness. On the other hand, the representative formsfor the N-gram model include foreign forms such asa posteriori (for English) and ekskavatör (for Turk-ish), which are not built out of parts that frequentlyrepeat in those languages. The representative formssuggest that the full model places more probabilitymass on words which are built out of highly produc-tive, phonotactically well-formed parts.

6 Discussion

We find that our models succeed in assigning highprobabilities to unseen forms, that they do so specifi-cally for linguistic forms and not random sequences,that they tend to favor forms with many productiveparts, and that they perform comparably to a state-of-the-art constraint-based model in assigning prob-abilities to short forms.

The improvement for our models over the N-grambaseline is consistent but not large. We attribute thisto the way in which phonological generalizations areused in the present model: in particular, phonolog-ical generalizations function primarily as a form ofbackoff for a sequence model. Our models have lex-ical memoization at each node in a feature depen-dency graph; as such, the top node in the graph endsup representing transitional probabilities for wholephonemes conditioned on previous phonemes, andthe rest of the feature dependency graph functionsas a backoff distribution. When a model has beenexposed to many training forms, its behavior willbe largely dominated by the N-gram-like behaviorof the top node. In future work it might be effec-tive to learn an optimal backoff procedure whichgives more influence to the base distribution (Duhand Kirchhoff, 2004; Wood and Teh, 2009).

While the tier-based conditioning in our modelwould seem to be capable of modeling nonlocalinteractions such as vowel harmony, we have notfound that the models do well at reproducing thesenonlocal interactions. We believe this is because themodel’s behavior is dominated by nodes high in thefeature dependency graph. In any case, a simpleMarkov model defined over tiers, as we have pre-sented here, might not be enough to fully modelvowel harmony. Rather, a model of phonologicalprocesses, transducing underlying forms to surfaceforms, seems like a more natural way to capturethese phenomena.

We stress that this model is not tied to a particularfeature dependency graph. In fact, we believe ourmodel provides a novel way of testing different hy-potheses about feature structures, and could form thebasis for learning the optimal feature hierarchy for agiven data set. The choice of feature dependencygraph has a large effect on what featural interactionsthe model can represent directly. For example, nei-ther feature dependency graph has shared place fea-tures for consonants and vowels, so the model haslimited ability to represent place-based restrictionson consonant-vowel sequences such as requirementsfor labialized or palatalized consonants in the con-text of /u/ or /i/. These interactions can be treated inour framework if vowels and consonants share placefeatures, as in Padgett (2011).

7 Conclusion

We have presented a probabilistic generative modelfor sequences of phonemes defined in terms ofphonological features, based on representationalideas from generative phonology and tools fromBayesian nonparametric modeling. We considerour model as a proof of concept that probabilisticstructure-building models can include rich featuralinteractions. Our model robustly outperforms an N-gram model on simple metrics, and learns to gener-ate forms consisting of highly productive parts. Wealso view this work as a test of the scientific hy-potheses that phonological features can be organizedin a hierarchy and that they interact along tiers: inour model evaluation, we found that both conceptswere necessary to get an improvement over a base-line N-gram model.

Page 13: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

Acknowledgments

We would like to thank Tal Linzen, Leon Bergen,Edward Flemming, Edward Gibson, Bob Berwick,Jim Glass, and the audiences at MIT’s PhonologyCircle, SIGMORPHON, and the LSA 2016 AnnualMeeting for helpful comments. This work was sup-ported in part by NSF DDRIG Grant #1551543 toR.F.

ReferencesAdam Albright. 2009. Feature-based generalization as a

source of gradient acceptability. Phonology, 26:9–41.Robert C. Berwick. 1985. The acquisition of syntactic

knowledge. MIT Press, Cambridge, MA.Jeff A. Bilmes and Katrin Kirchhoff. 2003. Fac-

tored language models and generalized parallel back-off. In Proceedings of the 2003 Conference of theNorth American Chapter of the Association for Com-putational Linguistics on Human Language Technol-ogy: Companion Volume of the Proceedings of HLT-NAACL 2003–Short Papers–Volume 2, pages 4–6. As-sociation for Computational Linguistics.

Geert Booij. 2011. Morpheme structure constraints. InThe Blackwell Companion to Phonology. Blackwell.

Noam Chomsky and Morris Halle. 1965. Some contro-versial questions in phonological theory. Journal ofLinguistics, 1:97–138.

Noam Chomsky and Morris Halle. 1968. The SoundPattern of English. Harper & Row, New York, NY.

George N. Clements and Elizabeth V. Hume. 1995. Theinternal organization of speech sounds. In The Hand-book of Phonological Theory, pages 24–306. Black-well, Oxford.

George N. Clements. 1985. The geometry of phonologi-cal features. Phonology Yearbook, 2:225–252.

John Coleman and Janet B. Pierrehumbert. 1997.Stochastic phonological grammars and acceptability.In John Coleman, editor, Proceedings of the 3rd Meet-ing of the ACL Special Interest Group in Computa-tional Phonology, pages 49–56, Somserset, NJ. Asso-ciation for Computational Linguistics.

Robert Daland, Bruce Hayes, James White, MarcGarellek, Andreas Davis, and Ingrid Normann. 2011.Explaining sonority projection effects. Phonology,28:197–234.

Carl de Marcken. 1996. Linguistic structure as com-position and perturbation. In Proceedings of the 34thannual meeting on Association for Computational Lin-guistics, pages 335–341. Association for Computa-tional Linguistics.

B. Elan Dresher. 2009. The Contrastive Hierarchy inPhonology. Cambridge University Press.

Kevin Duh and Katrin Kirchhoff. 2004. Automaticlearning of language model structure. In Proceedingsof COLING 2004, Geneva, Switzerland.

Thomas S. Ferguson. 1973. A bayesian analysis ofsome nonparametric problems. Annals of Statistics,1(2):209–230.

John Goldsmith and Jason Riggle. 2012. Informationtheoretic approaches to phonology: the case of Finnishvowel harmony. Natural langauge and linguistic the-ory, 30(3):859–96.

Sharon Goldwater, Thomas L. Griffiths, and Mark John-son. 2006. Interpolating between types and tokens byestimating power-law generators. In Advances in Neu-ral Information Processing Systems, volume 18, pages459–466, Cambridge, MA. MIT Press.

Sharon Goldwater. 2006. Nonparametric Bayesian Mod-els of Lexical Acquisition. Ph.D. thesis, Brown Uni-versity.

Irving John Good. 1965. The Estimation of Probabili-ties. MIT Press, Cambridge, MA.

Noah D. Goodman, Vikash K. Mansinghka, Daniel Roy,Keith Bonawitz, and Joshua B. Tenenbaum. 2008.Church: A language for generative models. In Un-certainty in Artificial Intelligence, Helsinki, Finland.AUAI Press.

Peter Graff. 2012. Communicative Efficiency in the Lexi-con. Ph.D. thesis, Massachusetts Institute of Technol-ogy.

Morris Halle. 1959. The Sound Pattern of Russian: Alinguistic and acoustical investigation. Mouton, TheHague, The Netherlands.

Morris Halle. 1995. Feature geometry and featurespreading. Linguistic Inquiry, 26(1):1–46.

Bruce Hayes and Colin Wilson. 2008. A maximum en-tropy model of phonotactics and phonotactic learning.Linguistic Inquiry, 39(3):379–440.

Bruce Hayes. 2012. Blick - a phonotactic probabilitycalculator.

Jeffrey Heinz, Chetan Rawal, and Herbert G. Tanner.2011. Tier-based strictly local constraints for phonol-ogy. In The 49th Annual Meeting of the Associationfor Computational Linguistics.

Roman Jakobson, C. Gunnar M. Fant, and Morris Halle.1952. Preliminaries to Speech Analysis: The Distinc-tive Features and their Correlates. The MIT Press,Cambridge, Massachusetts and London, England.

Adam Jardine and Jeffrey Heinz. 2015. A concatenationoperation to derive autosegmental graphs. In Proceed-ings of the 14th Meeting on the Mathematics of Lan-guage (MoL 2015), pages 139–151, Chicago, USA,July.

Page 14: A Generative Model of Phonotacticspeople.linguistics.mcgill.ca/~timothy.odonnell/papers/futrell.r.2017.pdfwe take a fully generative approach, model-ing a process where forms are built

Ya Jin and Stuart Geman. 2006. Context and hierar-chy in a probabilistic image model. In Proceedingsof the 2006 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVRP’06),pages 2145–2152.

Mark Johnson and Sharon Goldwater. 2009. Improvingnonparameteric bayesian inference: experiments onunsupervised word segmentation with adaptor gram-mars. In NAACL-HLT 2009, pages 317–325. Associa-tion for Computational Linguistics.

Mark Johnson, Thomas L Griffiths, and Sharon Goldwa-ter. 2006. Adaptor grammars: A framework for speci-fying compositional nonparametric bayesian models.In Advances in Neural Information Processing Sys-tems, pages 641–648.

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwa-ter. 2007. Adaptor Grammars: A framework for spec-ifying compositional nonparametric Bayesian models.In Advances in Neural Information Processing Sys-tems 19, Cambridge, MA. MIT Press.

Mary-Louise Kean. 1975. The Theory of Markednessin Generative Grammar. Ph.D. thesis, MassachusettsInstitute of Technology.

David J.C. MacKay and Linda C. Bauman Peto. 1994. Ahierarchical Dirichlet language model. Natural Lan-guage Engineering, 1:1–19.

John McCarthy. 1988. Feature geometry and depen-dency: A review. Phonetica, 43:84–108.

Donald Michie. 1968. “memo” functions and machinelearning. Nature, 218:19–22.

Timothy J. O’Donnell. 2015. Productivity and Reuse inLanguage: A Theory of Linguistic Computation andStorage. The MIT Press, Cambridge, Massachusettsand London, England.

Jaye Padgett. 2011. Consonant-vowel place feature in-teractions. In The Blackwell Companion to Phonol-ogy, pages 1761–1786. Blackwell Publishing, Malden,MA.

Ezer Rasin and Roni Katzir. 2014. A learnability argu-ment for constraints on underlying representations. InProceedings of the 45th Annual Meeting of the NorthEast Linguistic Society (NELS 45), Cambridge, Mas-sachusetts.

Jayaram Sethuraman. 1994. A constructive definition ofDirichlet priors. Statistica Sinica, pages 639–650.

Richard Stanley. 1967. Redundancy rules in phonology.Language, 43(2):393–436.

Yee Whye Teh. 2006. A Bayesian interpretation of in-terpolated Kneser-Ney. Technical Report TRA2/06,School of Computing, National University of Singa-pore.

Joshua B Tenenbaum and Thomas L Griffiths. 2001. Therational basis of representativeness. In Proceedings of

the 23rd Annual Conference of the Cognitive ScienceSociety, pages 1036–1041.

Nikolai S. Trubetzkoy. 1939. Grundzüge der Phonolo-gie. Number 7 in Travaux du Cercle Linguistique dePrague. Vandenhoeck & Ruprecht, Göttingen.

Frank Wood and Yee Whye Teh. 2009. A hierarchi-cal nonparametric Bayesian approach to statistical lan-guage model domain adaptation. In Artificial Intelli-gence and Statistics, pages 607–614.

Frank Wood, Cédric Archambeau, Jan Gasthaus, JamesLancelot, and Yee Whye Teh. 2009. A stochasticmemoizer for sequence data. In Proceedings of the26 th International Conference on Machine Learning,Montreal, Canada.