Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008
description
Transcript of Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008
Unsupervised Approaches to Sequence Tagging, Morphology Induction, and
Lexical Resource Acquisition
Reza Bosaghzadeh & Nathan Schneider
LS2 ~ 1 December 2008
Unsupervised methods– Sequence labeling (Part-of-Speech tagging)
– Morphology Induction
– Lexical Resource acquisition
.
She ran to the station quickly
pronoun verb preposition det noun adverb
un-supervise-d learn-ing
Part of Speech (POS) Tagging
• Prototype-driven model– Cluster based on a few examples – (Haghighi and Klein 2006)
• Contrastive estimation – Discount positive examples at cost of
implicit negative example– (Smith and Eisner 2005)
Prototype Driven taggingOverview
+
PrototypesTarget Label
Unlabeled Data
Prototype List
Annotated Data
(Haghighi and Klein 2006)
Prototypes
Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed.
FEATURE kitchen, laundry
LOCATION near, closeTERMS paid, utilitiesSIZE large, feetRESTRICT cat, smoking
Information Extraction: Classified Ads
FeaturesLocationTermsRestrictSize
Prototype List
Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed.
(Haghighi and Klein 2006)
Prototypes
Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed.
Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed.
Prototype List
NN VBN CC JJ CD PUNCIN NNS IN NNP RB DET
NN president IN of
VBD said NNS shares
CC and TO to
NNP Mr. PUNC .
JJ new CD million
DET the VBP are
English POS
(Haghighi and Klein 2006)
Where to get Prototypes
• 3 prototypes per tag
• Automatically extracted by frequency
(Haghighi and Klein 2006)
Prototypes
Tie each word to its most similar prototype
Features are same as (Smith & Eisner 05)
Trigram taggerWord type, suffixes up to length 3, contains hyphen, contains digit, initial capitalization
Similarity (Schuetze 93)• SVD dimensionality reduction• cos() between context vectors
(Haghighi and Klein 2006)
Prototypes – Language Variation
Pros• Can get prototypes for popular
languages• Doesn’t require tagging dictionary
Cons• Needs tag set• Needs prototypes for each tag
Contrastive Estimation• Already discussed in class• Key Idea:
– Mutating training examples often gives ungrammatical (negative) sentences
– During training, shift probability mass from generated negative examples to given positive examples
(Smith and Eisner 2005)
Unsupervised POS taggingCurrent results
Best supervised results (CRFs): 99.5% !
Unsupervised methods– Sequence labeling (Part-of-Speech tagging)
– Morphology Induction
– Lexical Resource acquisition
.
She ran to the station quickly
pronoun verb preposition det noun adverb
un-supervise-d learn-ing
Unsupervised Approaches to Morphology
• Morphology refers to the internal structure of words– A morpheme is a minimal meaningful
linguistic unit– Morpheme segmentation is the
process of dividing words into their component morphemes
unsupervised learning
.
Unsupervised Approaches to Morphology
• Morphology refers to the internal structure of words– A morpheme is a minimal meaningful
linguistic unit– Morpheme segmentation is the process of
dividing words into their component morphemes
un-supervise-d learn-ing– Word segmentation is the process of
finding word boundaries in a stream of speech or text
unsupervisedlearningofnaturallanguage
Unsupervised Approaches to Morphology
• Morphology refers to the internal structure of words– A morpheme is a minimal meaningful
linguistic unit– Morpheme segmentation is the process of
dividing words into their component morphemes
un-supervise-d learn-ing– Word segmentation is the process of
finding word boundaries in a stream of speech or textunsupervised learning of natural language
ParaMor: Monson et al. (2007, 2008)
• Learns inflectional paradigms from raw text– Requires only the vocabulary of a corpus– Looks at word counts of substrings, and
proposes (stem, suffix) pairings based on type frequency
• 3-stage algorithm– Stage 1: Candidate paradigms based on
frequencies– Stages 2-3: Refinement of paradigm set via
merging and filtering• Paradigms can be used for morpheme
segmentation or stemming
ParaMor: Monson et al. (2007, 2008)
speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• A sampling of Spanish verb
conjugations
ParaMor: Monson et al. (2007, 2008)
speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• A proposed paradigm with stems {habl,
bail} and suffixes {-ar, -o, -amos, -an}
ParaMor: Monson et al. (2007, 2008)
speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …• Same paradigm from previous slide,
but with stems {habl, bail, compr}
ParaMor: Monson et al. (2007, 2008)
speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• From just this list, other paradigm analyses
(which happen to be incorrect) are possible
ParaMor: Monson et al. (2007, 2008)
speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …• Another possibility: stems {hab, bai},
suffixes {-lar, -lo, -lamos, -lan}
ParaMor: Monson et al. (2007, 2008)
speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …• Spurious segmentations—this paradigm
doesn’t generalize to comprar (or most verbs)
ParaMor: Monson et al. (2007, 2008)
• What if not all conjugations were in the corpus?
speak dance buyhablar bailar comprar
bailo comprohablamos bailamos compramoshablan… … …
ParaMor: Monson et al. (2007, 2008)
• We have two similar paradigms that we want to merge
speak dance buyhablar bailar comprar
bailo comprohablamos bailamos compramoshablan… … …
ParaMor: Monson et al. (2007, 2008)
speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …• This amounts to smoothing, or
“hallucinating” out-of-vocabulary items
ParaMor: Monson et al. (2007, 2008)
• Heuristic-based, deterministic algorithm can learn inflectional paradigms from raw text
• Paradigms can be used straightforwardly to predict segmentations– Combining the outputs of ParaMor and
Morfessor (another system) won the segmentation task at MorphoChallenge 2008 for every language: English, Arabic, Turkish, German, and Finnish
• Currently, ParaMor assumes suffix-based morphology
• Word segmentation results – comparison
• See Narges & Andreas’s presentation for more on this model
Goldwater Unigram DPGoldwater Bigram HDP
Goldwater et al. (2006; in submission)
Multilingual morpheme segmentation Snyder & Barzilay (2008)
speak ES speak FRhablar parlerhablo parlehablamos parlonshablan parlent… …• Abstract morphemes cross
languages: (ar, er), (o, e), (amos, ons), (an, ent), (habl, parl)
• Considers parallel phrases and tries to find morpheme correspondences
• Stray morphemes don’t correspond across languages
Morphology papers: inputs & outputs
Unsupervised methods– Sequence labeling (Part-of-Speech
tagging)
– Morphology Induction
– Lexical Resource acquisition
.
She ran to the station quickly
pronoun verb preposition det noun adverb
un-supervise-d learn-ing
Lexical Resource Acquisition• Learning Bilingual Lexicons from
Monolingual Corpora (Haghighi et al. 2008)
• Narrative Event Chain Acquisition (Chambers and Jurafsky 2008)
• Labeling semantic roles for verbs (Grenager and Manning 2006)
Learning Bilingual Lexicons from Monolingual Corpora
SourceText
TargetText
Matching
mstate
world
name
Source Words
s
nation
estado
política
Target Words
t
mundo
nombre
(Haghighi et al. 2008)
Bilingual Lexicons: Data Representation
state
Orthographic Features 1.0
1.0
1.0
#sttatte#
5.0
20.0
10.0
Context Featuresworldpoliticssociety
SourceText
estado
Orthographic Features 1.0
1.0
1.0
#esstado#
10.0
17.0
6.0
Context Featuresmundopoliticasociedad
TargetText
(Haghighi et al. 2008)
Use CCA (Canonical Correlation Analysis) Trained with Hard EM to find best matching
Feature Experiments
Series10
25
50
75
100
61.1
80.1 80.289.0
Edit Dist Ortho Context MCCA
Prec
isio
n• MCCA: Orthographic and context
features
4k EN-ES Wikipedia Articles (Haghighi et al. 2008)
Narrative events: Chambers & Jurafsky (2008)
• Given a corpus, identifies related events that constitute a “narrative” and (when possible) predict their typical temporal ordering– E.g.: CRIMINAL PROSECUTION narrative, with
verbs: arrest, accuse, plead, testify, acquit/convict
• Key insight: related events tend to share a participant in a document– The common participant may fill different
syntactic/semantic roles with respect to verbs: arrest.OBJECT, accuse.OBJECT, plead.SUBJECT
Narrative events: Chambers & Jurafsky (2008)
• A temporal classifier can reconstruct pairwise canonical event orderings, producing a directed graph for each narrative
Narrative events: Grenager & Manning (2006)
• From dependency parses, a generative model predicts semantic roles corresponding to each verb’s arguments, as well as their syntactic realizations– PropBank-style: ARG0, ARG1, etc. per verb (do
not necessarily correspond across verbs)– Learned syntactic patterns of the form:
(subj=give.ARG0, verb=give, np#1=give.ARG2, np#2=give.ARG1) or (subj=give.ARG0, verb=give, np#2=give.ARG1, pp_to=give.ARG2)
• Used for semantic role labeling
“Semanticity”: Our proposed scale of semantic richness
• text < POS < syntax/morphology/alignments < coreference/semantic roles/temporal ordering < translations/narrative event sequences
• We score each model’s inputs and outputs on this scale, and call the input-to-output increase “semantic gain”– Haghighi et al.’s bilingual lexicon induction wins
in this respect, going from raw text to lexical translations
Robustness to language variation• About half of the papers we examined
had English-only evaluations• We considered which techniques were
most adaptable to other (esp. resource-poor) languages. Two main factors:– Reliance on existing tools/resources for
preprocessing (parsers, coreference resolvers, …)
– Any linguistic specificity in the model (e.g. suffix-based morphology)
SummaryWe examined three areas of unsupervised NLP:
1. Sequence tagging: How can we predict POS (or topic) tags for words in sequence?
2. Morphology: How are words put together from morphemes (and how can we break them apart)?
3. Lexical resources: How can we identify lexical translations, semantic roles and argument frames, or narrative event sequences from text?
In eight recent papers we found a variety of approaches, including heuristic algorithms, Bayesian methods, and EM-style techniques.
Thanks to Noah and Kevin for their feedback on the paper; Andreas and Narges for their collaboration on the presentations; and all of you for giving us your attention!
Questions?
un-supervise-d learn-inghablar bailarhablo bailohablamos
bailamos
hablan bailan
subj=give.ARG0 verb=give np#1=give.ARG2 np#2=give.ARG1