Post on 29-Dec-2015
Minimum Description LengthMinimum Description Length
An Adequate Syntactic An Adequate Syntactic
Theory?Theory?
Mike DowmanMike Dowman
3 June 20053 June 2005
Linguistic Theory
Language Acquisition
Device
Individual's Knowledge of Language
Primary Linguistic
Data
Chomsky’s Conceptualization of Language Acquisition
Diachronic Theories
Language Acquisition
Device
Arena of Language
Use
Primary Linguistic
Data
Individual's Knowledge of
Language
Hurford’s Diachronic Spiral
Learnability
Poverty of the stimulusLanguage is really complex
Obscure and abstract rules constrain, wh-movement, pronoun binding, passive formation, etc.
Examples of E-language don’t give sufficient information to determine this
WH-movement
Whoi do you think Lord Emsworth will invite ti?
Whoi do you think that Lord Emsworth will invite ti?
Whoi do you think ti will arrive first?
* Whoi do you think that ti will arrive first?
Negative Evidence
• Some constructions seem impossible to learn without negative evidence
John gave a painting to the museum
John gave the museum a painting
John donated a painting to the museum
* John donated the museum a painting
Implicit Negative Evidence
If constructions don’t appear can we just assume they’re not grammatical?
No – we only see a tiny proportion of possible, grammatical sentences
People generalize from examples they have seen to form new utterances
‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)
Learnability Proofs
Gold (1967) for languages to be learnable in the limit we must have:
• Negative evidence
• or a priori restrictions on possible languages
But learnable in the limit means being sure that we have determined the correct language
Statistical Learnability
Horning (1969)• If grammars are statistical• so utterances are produced with
frequencies corresponding to the grammar Languages are learnable• But we can never be sure when the
correct grammar has been found• This just gets more likely as we see more
data
Horning’s Proof
• Used Bayes’ rule
• More complex grammars are less probable a priori P(h)
• Statistical grammars can assign probabilities to data P(d | h)
• Search through all possible grammars, starting with the simplest
)|()()|( hdPhPdhP
MDL
Horning’s evaluation method for grammars can be seen as a form of Minimum Description Length
Simplest is best (Occam’s Razor) Simplest means specifiable with the least
amount of informationInformation theory (Shannon, 1948) allows
us to link probability and information:Amount of information = -log Probability
Encoding Grammars and Data
1010100111010100101101010001100111100011010110
Grammar Data coded in terms of grammar
Decoder
A B C
B D E
E {kangaroo, aeroplane, comedian}
D {the, a, some}
C {died, laughed, burped}
The comedian died
A kangaroo burped
The aeroplane laughed
Some comedian burped
Complexity and Probability
• More complex grammarLonger coding length, so lower probability
• More restrictive grammarLess choices for data, so each possibility
has a higher probability
• Most restrictive grammar just lists all possible utterances
Only the observed data is grammatical, so it has a high probability
• A simple grammar could be made that allowed any sentences
Grammar would have a high probabilityBut data a very low one
MDL finds a middle ground between always generalizing and never generalizing
Rampant Synonymy?
• Inductive inference (Solomonoff, 1960a)• Kolmogorov complexity (Kolmogorov, 1965)• Minimum Message Length (Wallace and
Boulton, 1968)• Algorithmic Information Theory (Chaitin, 1969)• Minimum Description Length (Rissanen, 1978)• Minimum Coding Length (Ellison, 1992)• Bayesian Learning (Stolcke, 1994)• Minimum Representation Length (Brent, 1996)
Evaluation and Search
• MDL principle gives us an evaluation criterion for grammars (with respect to corpora)
• But it doesn’t solve the problem of how to find the grammars in the first place
Search mechanism needed
Two Learnability Problems
• How to determine which of two or more grammars is best given some data
• How to guide the search for grammars so that we can find the correct one, without considering every logically possible grammar
MDL in Linguistics
• Solomonoff (1960b): ‘Mechanization of Linguistic Learning’
• Learning phrase structure grammars for simple ‘toy’ languages: Stolcke (1994), Langley and Stromsten (2000)
• Or real corpora: Chen (1995), Grünwald (1994)
• Or for language modelling in speech recognition systems: Starkie (2001)
Not Just Syntax!
• Phonology: Ellison (1992), Rissanen and Ristad (1994)
• Morphology: Brent (1993), Goldsmith (2001)
• Segmenting continuous speech: de Marcken (1996), Brent and Cartwright (1997)
MDL and Parameter Setting• Briscoe (1999) and Rissanen and Ristad
(1994) used MDL as part of parameter setting learning mechanisms
MDL and Iterated Learning• Briscoe (1999) used MDL as part of an
expression-induction model
• Brighton (2002) investigated effect of bottlenecks on an MDL learner
• Roberts et al (2005) modelled lexical exceptions to syntactic rules
An Example: My Model
Learns simple phrase structure grammars• Binary or non-branching rules:A B CD EF tomato• All derivations start from special symbol S• null symbol in 3rd position indicates non-
branching rule
Encoding Grammars
Grammars can be coded as lists of three symbols
• First symbol is rules left hand side, second and third its right hand side
A, B, C, D, E, null, F, tomato, null
• First we have to encode the frequency of each symbol
1 S NP VP (3)2 NP john (2)3 NP mary (1)4 VP screamed (2)5 VP died (1)
Data: 1, 2, 4, 1, 2, 5, 1, 3, 4Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…
We must record the frequency of each rule
Encoding Data
Total frequency for NP = 3
Total frequency for VP = 3
Total frequency for S = 3
Encoding in My Model
1010100111010100101101010001100111100011010110
Symbol Frequencies
Rule Frequencies
Decoder
1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died
John screamedJohn diedMary Screamed
Grammar Data
S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)
Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1
Search Strategy
• Start with simple grammar that allows all sentences
• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)
• Annealing search
• First stage: just look at data coding length
• Second stage: look at overall evaluation
John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary
Example: EnglishLearned Grammar
S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam
Evaluations
050
100150200250300350400450
Eva
luat
ion
(b
its)
InitialGrammar
LearnedGrammar
OverallEvaluation
Grammar
Data
Dative Alternation
• Children learn distinction between alternating and non-alternating verbs
• Previously unseen verbs are used productively in both constructions
New verbs follow regular pattern
• During learning children use non-alternating verbs in both constructions
U-shaped learning
Training Data
• Three alternating verbs: gave, passed, lent
• One non-alternating verb: donated
• One verb seen only once: sent
The museum lent Sam a painting
John gave a painting to Sam
Sam donated John to the museum
The museum sent a painting to Sam
Dative Evaluations
0
500
1000
1500
2000
2500
3000
3500
Eva
luat
ion
(b
its)
InitialGrammar
LearnedGrammar
OverallEvaluation
Grammar
Data
Grammar Properties
• Learned grammar distinguishes alternating and non-alternating verbs
• sent appears in alternating class
• With less data, only one class of verbs, so donated can appear in both constructions
• All sentences generated by the grammar are grammatical
• But structures are not right
Learned Structures
John gave a painting to Sam
NP VA DET N P NP
NP
S
Z
Y
X
• Why does the model place a newly seen verb in the regular class?
Y VA NPY VA ZY VP ZVA passedVA gaveVA lentVP donated
VA / VP sent
Regular and Irregular Rules
sent doesn’t alternate
sent alternates
Overall Evaluation (bits)
1703.6 1703.4
Grammar (bits)
322.2 321.0
Data (bits) 1381.4 1382.3
Regular constructions are preferred because the grammar is coded statistically
Why use Statistical Grammars?
Statistics are a valuable source of information They help to infer when absences are due to
chanceThe learned grammar predicted that sent should
appear in the double object construction• but in 150 sentences it was only seen in the
prepositional dative construction• With a non statistical grammar we need an
explanation as to why this is• A statistical grammar knows that sent is rare,
which explains the absence of double object occurrences
Scaling Up: Onnis, Roberts and Chater (2003)
Causative alternation:
John cut the string* The string cut* John arrived the trainThe train arrivedJohn bounced the ballThe ball bounced
Onnis et al’s Data
• Two word classes: N and V• NV and VN only allowable sentences16 verbs alternate: NV + VN10 verbs NV only10 verbs VN only
Coding scheme marks non-alternating verbs as exceptional (cost in coding scheme)
Onnis et al’s Results
< 16,000 sentences all verbs alternate
> 16,000 sentences non alternating verbs classified as exceptional
No search mechanism Just looked at evaluations with and without exceptions
In expression-induction model quasi-regularities appear as a result of chance omissions
MDL and MML Issues
• Numeric parameters - accuracy
• Bayes’ optimal classification (not MAP learning) – Monte Carlo methods
If we see a sentence, work out the probability of it for each grammar
Weighted sum gives probability of sentence
• Unseen data – zero probability?
One and Two Part Codes
1010100111010100101101010001100111100011010110
Grammar Data coded in terms of grammar
Decoder
Data and grammar combined
Decoder 1010100111010100101101010001100111100011010110
Grammar Data
Coding English Texts
Grammar is a frequency for each letter and for space
• Counts start at one• We decode a series of letters – and
update the counts for each letter• All letters coded in terms of their
probabilities at that point in the decoding• At end we have a decoded text and
grammar
Decoding Example
Letter Count Count Count Count
A 1 2 2 2
B 1 1 2 3
C 1 1 1 1
Space 1 1 1 1
Decoded string
A (P=1/4)
B (P=1/5)
B (P=2/6)
One-Part Grammars
Grammars can also be coded using one-part codes
• Start with no grammar, but have a probability associated with adding a new rule
• Each time we decode data we first choose to add a new rule, or use an existing one
Examples are Dowman (2000) or Venkataraman (1997)
ConclusionsConclusions
MDL can solve the poverty of the stimulus MDL can solve the poverty of the stimulus problemproblem
But it doesn’t solve the problem of But it doesn’t solve the problem of constraining the search for grammarsconstraining the search for grammars
Coding schemes create learning biasesCoding schemes create learning biases Statistical grammars and statistical coding of Statistical grammars and statistical coding of
grammars can help learninggrammars can help learning