Download
description
Transcript of Download
1
Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation
Masters Thesis : Saif Mohammad Advisor : Dr. Ted Pedersen
University of Minnesota, DuluthDate: August 1, 2003
2
Path Map
Introduction Background Data Experiments Conclusions
3
Word Sense DisambiguationHarry cast a bewitching spell
Humans immediately understand spell to mean a charm or incantation
reading out letter by letter or a period of time ? Words with multiple senses – polysemy, ambiguity
Utilize background knowledge and context
Machines lack background knowledge Automatically identifying the intended sense of a word in
written text, based on its context, remains a hard problem Features are identified from the context Best accuracies in latest international event, around
65%
4
Why do we need WSD ! Information Retrieval
Query: cricket bat Documents pertaining to the insect and the mammal, irrelevant
Machine Translation Consider English to Hindi translation
head to sar (upper part of the body) or adhyaksh (leader)
Machine Human interaction Instructions to machines
Interactive home system: turn on the lights Domestic Android: get the door
Applications are widespread and will affect our way of life
5
TerminologyHarry cast a bewitching spell
Target word – the word whose intended sense is to be identified
spell
Context – the sentence housing the target word and possibly, 1 or 2 sentences around it
Harry cast a bewitching spell
Instance – target word along with its context
WSD is a classification problem wherein the occurrence of the target word is assigned to one of its many possible senses
6
Corpus-Based Supervised Machine Learning
A computer program is said to learn from experience … if its performance at tasks … improves with experience
- Mitchell
Task : Word Sense Disambiguation of given test instances
Performance : Ratio of instances correctly disambiguated to the total test instances - accuracy
Experience : Manually created instances such that target words are marked with intended sense – training instances
Harry cast a bewitching spell / incantation
7
Path Map
Introduction Background Data Experiments Conclusions
8
Decision Trees A kind of classifier
Assigns a class by asking a series of questions Questions correspond to features of the instance Question asked depends on answer to previous question
Inverted tree structure Interconnected nodes
Top most node is called the root
Each node corresponds to a question / feature Each possible value of feature has corresponding branch
Leaves terminate every path from root Each leaf is associated with a class
9
Automating Toy Selection for Max
Moving Parts ?
Color ?
Size ?
Car ?
Size ?
Car ? LOVE
LOVESO SO
LOVEHATE
HATE
SO SO
HATE
No
No
No
Yes
Yes
Yes
Blue
Big
Red
Small
Other
Small Big
ROOTNODES
LEAVES
10
WSD Tree
Feature 4?
Feature 4 ?
Feature 2 ?
Feature 3 ?
Feature 2 ?
SENSE 4
SENSE 3SENSE 2
SENSE 1
SENSE 3
SENSE 3
0
0
0
1
1
1
0
10
1
0 1
Feature 1 ?
SENSE 1
11
Issues…
Why use decision trees for WSD ? How are decision trees learnt ?
ID3 and C4.5algorithms
What is bagging and its advantages
Drawbacks of decision trees bagging
Pedersen[2002]: Choosing the right features is of greater significance than the learning algorithm itself
12
Lexical Features Surface form
A word we observe in text Case(n)
1. Object of investigation 2. frame or covering 3. A weird person Surface forms : case, cases, casing An occurrence of casing suggests sense 2
Unigrams and Bigrams One word and two word sequences in text
The interest rate is low Unigrams: the, interest, rate, is, low Bigrams: the interest, interest rate, rate is, is low
13
Part of Speech Tagging
Pre-requisite for many Natural Language Tasks Parsing, WSD, Anaphora resolution
Brill Tagger – most widely used tool
Accuracy around 95% Source code available Easily understood rules
Harry/NNP cast/VBD a/DT bewitching/JJ spell/NN
NNP proper noun, VBD verb past, DT determiner, NN noun
14
Pre-Tagging Pre-tagging is the act of manually assigning tags to
selected words in a text prior to taggingMona will sit in the pretty chair//NN this time
chair is the pre-tagged word, NN is its pre-tag Reliable anchors or seeds around which tagging is done
Brill Tagger facilitates pre-tagging Pre-tag not always respected !
Mona/NNP will/MD sit/VB in/IN the/DT pretty/RB chair//VB this/DT time/NN
15
Contextual Rules Initial state tagger – assigns most frequent tag for a type
based on entries in a Lexicon (pre-tag respected)
Final state tagger – may modify tag of word based on context (pre-tag not given special treatment)
Relevant Lexicon Entries Type Most frequent tag Other possible tags
chair NN(noun) VB(verb) pretty RB(adverb) JJ(adjective)
Relevant Contextual Rules Current Tag New Tag When
NN VB NEXTTAG DT RB JJ NEXTTAG NN
16
Guaranteed Pre-Tagging A patch to the tagger provided – BrillPatch
Application of contextual rules to the pre-tagged words bypassed
Application of contextual rules to non pre-tagged words unchanged.
Mona/NNP will/MD sit/VB in/IN the/DT
pretty/JJ chair//NN this/DT time/NN
Tag of chair retained as NN Contextual rule to change tag of chair from NN to VB not applied
Tag of pretty transformed Contextual rule to change tag of pretty from RB to JJ applied
17
Part of Speech Features A word in different parts of speech has different senses A word used in different senses is likely to have different
sets of pos around it
Why did jack turn/VB against/IN his/PRP$ team/NNWhy did jack turn/VB left/VBN at/IN the/DT crossing
Features used Individual word POS: P-2, P-1, P0, P1, P2*
P2 = JJ implies P2 is an adjective
Sequential POS: P-1P0, P-1P0 P1, and so on P-1P0 = NN, VB implies P-1 is a noun and P0 is a verb
A combination of the above
18
Parse Features Collins Parser used to parse the data
Source code available Uses part of speech tagged data as input
Head word of a phrase the hard work, the hard surface Phrase itself : noun phrase, verb phrase and so on
Parent : Head word of the parent phrase fasten the line, cross the line Parent Phrase
19
Sample Parse Tree
VERB PHRASENOUN PHRASE
Harry NOUN PHRASE
SENTENCE
spell
cast
a bewitching
NNP VBD
DT JJ NN
20
Path Map
Introduction Background Data Experiments Conclusions
21
Sense-Tagged Data Senseval2 data
4328 instances of test data and 8611 instances of training data ranging over 73 different noun, verb and adjectives.
Senseval1 data 8512 test instances and 13,276 training instances, ranging over 35
nouns, verbs and adjectives.
Line, hard, interest, serve data 4,149, 4,337, 4378 and 2476 sense-tagged instances with
line, hard, serve and interest as the head words.
Around 50,000 sense-tagged instances in all !
22
Data Processing Packages to convert line hard, serve and interest data to
Senseval-1 and Senseval-2 data formats refine preprocesses data in Senseval-2 data format to make it
suitable for tagging Restore one sentence per line and one line per sentence, pre-tag
the target words, split long sentences posSenseval part of speech tags any data in Senseval-2 data
format Brill tagger along with Guaranteed Pre-tagging utilized
parseSenseval parses data in a format as output by the Brill Tagger
restores xml tags, creating a parsed file in Senseval-2 data format Uses the Collins Parser
23
Sample line data instanceOriginal instance:art} aphb 01301041:" There's none there . " He hurried outside to see if there
were any dry ones on the line .
Senseval-2 data format:<instance id="line-n.art} aphb 01301041:"><answer instance="line-n.art} aphb 01301041:"
senseid="cord"/><context><s> " There's none there . " </s> <s> He hurried outside
to see if there were any dry ones on the <head>line</head> . </s>
</context></instance>
24
Sample Output from parseSenseval<instance id=“harry"><answer instance=“harry" senseid=“incantation"/><context>Harry cast a bewitching <head>spell</head></context></instance>
<instance id=“harry"><answer instance=“harry" senseid=“incantation"/><context><P=“TOP~cast~1~1”> <P=“S~cast~2~2”> <P=“NPB~Potter~2~2”>
Harry <p=“NNP”/> <P=“VP~cast~2~1”> cast <p=“VB”/>
<P=“NPB~spell~3~3”>a <p=“DT”/> bewitching <p=“JJ”/> spell <p=“NN”/> </P> </P> </P>
</P> </context></instance>
25
Issues… How is the target word identified in line, hard and
serve data How the data is tokenized for better quality pos
tagging and parsing How is the data pre-tagged How is parse output of Collins Parser interpreted How is the parsed output XML’ized and brought
back to Senseval-2 data format Idiosyncrasies of line, hard, serve, interest,
Senseval-1 and Senseval-2 data and how they are handled
26
Path Map
Introduction Background Data Experiments Conclusions
27
Surface Forms Senseval-1 & Senseval-2
Senseval-2 Senseval-1
Majority 47.7% 56.3%
Surface Form
49.3% 62.9%
Unigrams 55.3% 66.9%
Bigrams 55.1% 66.9%
28
Individual Word POS (Senseval-1)
All Nouns Verbs Adj.
Majority 56.3% 57.2% 56.9% 64.3%
P-2 57.5% 58.2% 58.6% 64.0
P-1 59.2% 62.2% 58.2% 64.3%
P0 60.3% 62.5% 58.2% 64.3%
P1 63.9% 65.4% 64.4% 66.2%
P-2 59.9% 60.0% 60.8% 65.2%
29
Individual Word POS (Senseval-2)
All Nouns Verbs Adj.
Majority 47.7% 51.0% 39.7% 59.0%
P-2 47.1% 51.9% 38.0% 57.9%
P-1 49.6% 55.2% 40.2% 59.0%
P0 49.9% 55.7% 40.6% 58.2%
P1 53.1% 53.8% 49.1% 61.0%
P-2 48.9% 50.2% 43.2% 59.4%
30
Combining POS Features
Senseval-2 Senseval-1 line
Majority 47.7% 56.3% 54.3%
P0, P1 54.3% 66.7% 54.1%
P-1, P0, P1 54.6% 68.0% 60.4%
P-2, P-1, P0, P1 , P2 54.6% 67.8% 62.3%
31
Effect Guaranteed Pre-tagging on WSD
Guar. P. Reg. P. Guar. P. Reg. PP-1, P0 62.2% 62.1% 50.8% 50.9%
P0, P1 66.7% 66.7% 54.3% 53.8%
P-1, P0, P1 68.0% 67.6% 54.6% 54.7%
P-1P0, P0P1 66.7% 66.3% 54.0% 53.7%
P-2, P-1, P0, P1 , P2
67.8% 66.1% 54.6% 54.1%
Senseval-1 Senseval-2
32
Parse Features (Senseval-1)
All Nouns Verbs Adj.Majority 56.3% 57.2% 56.9% 64.3%
Head 64.3% 70.9% 59.8% 66.9%
Parent 60.6% 62.6% 60.3% 65.8%
Phrase 58.5% 57.5% 57.2% 66.2%
Par. Phr. 57.9% 58.1% 58.3% 66.2%
33
Parse Features (Senseval-2)
All Nouns Verbs Adj.Majority 47.7% 51.0% 39.7% 59.0%
Head 51.7% 58.5% 39.8% 64.0%
Parent 50.0% 56.1% 40.1% 59.3%
Phrase 48.3% 51.7% 40.3% 59.5%
Par. Phr. 48.5% 53.0% 39.1% 60.3%
34
Thoughts… Both lexical and syntactic features perform
comparably But do they get the same instances right ?
How much are the individual feature sets redundant Are there instances correctly disambiguated by
one feature set and not by the other ? How much are the individual feature sets
complementary
Is the effort to combine of lexical and syntactic features justified ?
35
Measures Baseline Ensemble: accuracy of a hypothetical ensemble
which predicts the sense correctly only if both individual feature sets do so
Quantifies redundancy amongst feature sets
Optimal Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so
Difference with individual accuracies quantifies complementarity
We used a simple ensemble which sums up the
probabilities for each sense by the individual feature
sets to decide the intended sense
36
Best Combinations
Data Set 1 Set 2 Base Maj. Ens. Opt.Sval2 Unigrams
55.3%
P-1,P0, P1
55.3%43.6% 47.7% 57.0% 67.9%
Sval1 Unigrams 66.9%
P-1,P0, P1 68.0%
57.6% 56.3% 71.1% 78.0%
line Unigrams 74.5%
P-1,P0, P1 60.4%
55.1% 54.3% 74.2% 82.0%
hard Bigrams 89.5%
Head, Par 87.7%
86.1% 81.5% 88.9% 91.3%
serve Unigrams 73.3%
P-1,P0, P1
73.0%58.4% 42.2% 81.6% 89.9%
Interest Bigrams 79.9%
P-1,P0, P1 78.8%
67.6% 54.9% 83.2% 90.1%
37
Path Map
Introduction Background Data Experiments Conclusions
38
Conclusions Significant amount of complementarity across
lexical and syntactic features Combination of the two justified
Part of speech of word immediately to the right of target word found most useful
Pos of words immediately to the right of target word best for verbs and adjectives
Nouns helped by tags on either side
Head word of phrase particularly useful for adjectives
Nouns helped by both head and parent
39
Other Contributions Converted line, hard, serve and interest data
into Senseval-2 data format
Part of speech tagged and Parsed the Senseval2, Senseval-1, line, hard, serve and interest data
Developed the Guaranteed Pre-tagging mechanism to improve quality of pos tagging
Showed that guaranteed pre-tagging improves WSD
40
Code, Data, Resources and Publication posSenseval : part of speech tags any data in Senseval-2 data format parseSenseval : parses data in a format as output by the Brill Tagger.
Output is in Senseval-2 data format with part of speech and parse information as xml tags.
Packages to convert line hard, serve and interest data to Senseval-1 and Senseval-2 data formats
BrillPatch : Patch to Brill Tagger to employ Guaranteed Pre-Tagging
http://www.d.umn.edu/~tpederse/data.html
Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z Collins Parser: http://www.ai.mit.edu/people/mcollins
“Guaranteed Pre-Tagging for the Brill Tagger”, Mohammad and Pedersen, Fourth International Conference of Intelligent Systems and Text Processing, February 2003, Mexico
41
Thank You