Discriminating Word Senses Using McQuitty’s Similarity Analysis
-
Upload
thomas-jordan -
Category
Documents
-
view
35 -
download
0
description
Transcript of Discriminating Word Senses Using McQuitty’s Similarity Analysis
![Page 1: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/1.jpg)
11
Discriminating Word Senses Discriminating Word Senses Using Using
McQuitty’s Similarity AnalysisMcQuitty’s Similarity Analysis
Amruta PurandareAmruta PurandareUniversity of Minnesota, DuluthUniversity of Minnesota, Duluth
Advisor : Dr Ted PedersenAdvisor : Dr Ted Pedersen
Research supported by National Science Foundation Research supported by National Science Foundation (NSF)(NSF)
Faculty Early Career Development Award Faculty Early Career Development Award (#0092784)(#0092784)
![Page 2: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/2.jpg)
22
Discriminating “line”Discriminating “line”
They will begin line formation before ceremonyConnect modem to any jack on your line
Quit printing after the last line of each fileYour line will not get tied while you are connected to net
Stand balanced and comfortable during line upLines that do not fit a page are truncated
New line service provides reliable connections Pages are separated by line feed characters They stand far right when in line formation
![Page 3: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/3.jpg)
33
They will begin line formation before ceremonyStand balanced and comfortable during line up
They stand far right when in line formation
Your line will not get tied while you are connected to netConnect modem to any jack on your line
New line service provides reliable connections
Quit printing after the last line of each pageLines that do not fit a page are truncated
Pages are separated by line feed characters
![Page 4: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/4.jpg)
44
IntroductionIntroduction• What is Word Sense Discrimination ?What is Word Sense Discrimination ?• Unsupervised learning Unsupervised learning
Training
Test
Features
Feature Vectors
Clusters
![Page 5: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/5.jpg)
55
Representing contextRepresenting context• Features (from training)Features (from training)
•Bi grams Bi grams •Unigrams Unigrams •Second Order Co-occurrences/SOCs Second Order Co-occurrences/SOCs
(Schütze98)(Schütze98)•MixtureMixture
• Feature vectors (Binary)Feature vectors (Binary)• Measuring similarity Measuring similarity
•CosineCosine•MatchMatch
![Page 6: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/6.jpg)
66
Feature examplesFeature examples<features> for line<features> for line
UnigraUnigramm
<blank> <text> <service> <connection> <blank> <text> <service> <connection> <modem><modem>
<paragraph> <jack> <reliable> <circuit> <file><paragraph> <jack> <reliable> <circuit> <file>
Bi gramBi gram <blank, <blank, lineline> <text, > <text, lineline> > <text, paragraph> <blank, space><text, paragraph> <blank, space>
<<lineline, service> <modem, jack>, service> <modem, jack><phone, service> <connection, <phone, service> <connection, lineline>>
<reliable, connection><reliable, connection>
SOCsSOCs <space> <paragraph> <phone> <reliable><space> <paragraph> <phone> <reliable>
![Page 7: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/7.jpg)
77
McQuitty’s methodMcQuitty’s method• Pedersen & Bruce, Pedersen & Bruce,
19971997• AgglomerativeAgglomerative• UPGMA / Average UPGMA / Average
LinkLink• Stopping rules Stopping rules
– Number of clustersNumber of clusters– Score cutoffScore cutoff
![Page 8: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/8.jpg)
88
EvaluationEvaluationS1S1 S2S2 S3S3 S4S4
C1C1 1010 00 33 22C2C2 11 11 77 11C3C3 22 11 11 66C4C4 22 1515 11 22
![Page 9: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/9.jpg)
99
EvaluationEvaluationS1S1 S3S3 S4S4 S2S2
C1C1 1010 33 22 00 1515C2C2 11 77 11 11 1010C3C3 22 11 66 11 1010C4C4 22 11 22 1515 2020
1515 1212 1111 1717 5555
![Page 10: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/10.jpg)
1010
Majority Sense ClassifierMajority Sense Classifier
![Page 11: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/11.jpg)
1111
Experimental DataExperimental DataLineLine Senseval-2Senseval-2
#Senses#Senses 66 VariableVariableSelected top 5Selected top 5
#instanc#instanceses
41464146(1200:600)(1200:600)
120/word, 73 words120/word, 73 words(100-150:50-100)(100-150:50-100)
![Page 12: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/12.jpg)
1212
Scope of the experimentsScope of the experiments
• 584 experiments (73 * 4 * 2)584 experiments (73 * 4 * 2)– 73 Words: 72 Senseval-2, LINE73 Words: 72 Senseval-2, LINE– 4 Features: Bi grams, Unigrams, SOCs, Mix4 Features: Bi grams, Unigrams, SOCs, Mix– 2 Similarity Measures: Match, Cosine2 Similarity Measures: Match, Cosine
• Window = 5 Window = 5 – for Bi grams and SOCsfor Bi grams and SOCs
• Frequency cutoff = 2Frequency cutoff = 2
![Page 13: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/13.jpg)
1313
Senseval-2 Results POS wiseSenseval-2 Results POS wise
66 7755 3377 88
COSCOS MATMAT
SOSOCCBIBIUNIUNI
COCOSS
MAMATT
COCOSS
MAMATT11 11
00 0011 00
1111 6655 551313 99
SOSOCCBIBIUNIUNI
SOSOCCBIBIUNIUNI
No of words of a POS for which experiment obtained accuracy more than Majority
![Page 14: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/14.jpg)
1414
Senseval-2 Results Feature Senseval-2 Results Feature wisewise
66 771111 6611 11
COSCOS MATMATNNVVADJADJ
COCOSS
MAMATT COCO
SSMAMATT77 88
1313 9911 00
55 3355 5500 00
NNVVADJADJ
NNVVADJADJ
![Page 15: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/15.jpg)
1515
Senseval-2 Results Measure Senseval-2 Results Measure wisewise
66 55 771111 55 131311 00 11
SOCSOC BIBI UNIUNINNVVADJADJ
SOSOCC
BIBI UNIUNI
77 33 8866 55 9911 00 00
NNVVADJADJ
![Page 16: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/16.jpg)
1616
0.250.25 0.230.230.190.19 0.180.180.210.21 0.200.20
COSCOS MATMATSOCSOCBIBI
UNIUNI
Line Results Line Results
On uniform distribution of 6 senses
![Page 17: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/17.jpg)
1717
Sample Confusion Table Sample Confusion Table (fine.soc.cos)(fine.soc.cos)
5.005.0011.6711.6763.3363.3316.6716.673.333.33
11.611.677
8.338.33 5050 23.323.333
6.66.677
22 00 00 11 0011 00 44 22 0022 55 2525 22 4411 00 00 99 0011 00 11 00 00
S0S0 S1S1 S2S2 S3S3 S4S4
77 55 3030 1414 44
33773838101022
60 S0 = elegantS1 = small grained
S2 = superior S3 = satisfactory
S4 = thin
![Page 18: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/18.jpg)
1818
ConclusionsConclusions• Small set of SOCs was powerfulSmall set of SOCs was powerful
– Half the number of unigrams/bigramsHalf the number of unigrams/bigrams• Scaling done by Cosine helps !Scaling done by Cosine helps !• Need more training data!Need more training data!• Need to improve feature… Need to improve feature…
• Selection (Tests of associations)Selection (Tests of associations)• extraction (Stemming)extraction (Stemming)• matching (Fuzzy matching)matching (Fuzzy matching)
… …strategies for bi grams strategies for bi grams • Explore new featuresExplore new features
• POS POS • CollocationsCollocations
![Page 19: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/19.jpg)
1919
Recent workRecent work• PDL implementation PDL implementation • Cluto - Clustering Toolkit Cluto - Clustering Toolkit
http://www-users.cs.umn.edu/~karypis/clutohttp://www-users.cs.umn.edu/~karypis/cluto•6 clustering methods, 12 merging criteria6 clustering methods, 12 merging criteria
• PlansPlans– Comparing clustering in Comparing clustering in
similarity space Vs vector space (similarity space Vs vector space (Schütze, Schütze, 19981998))
– Stopping rulesStopping rules
![Page 20: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/20.jpg)
2020
They will begin line formation before ceremonyStand balanced and comfortable during line up
They stand far right when in line formation
Your line will not get tied while you are connected to netConnect modem to any jack on your line
New line service provides reliable connections
Quit printing after the last line of each fileLines that do not fit a page are truncated
Pages are separated by line feed characters
Sense labelingSense labeling
![Page 21: Discriminating Word Senses Using McQuitty’s Similarity Analysis](https://reader036.fdocuments.us/reader036/viewer/2022062310/568137ef550346895d9fa9b4/html5/thumbnails/21.jpg)
2121
Software PackagesSoftware Packages• SenseClusters SenseClusters (Our Discrimination Toolkit)(Our Discrimination Toolkit)
http://www.d.umn.edu/~tpederse/senseclusters.htmlhttp://www.d.umn.edu/~tpederse/senseclusters.html• PDL PDL (Used to implement clustering algorithms)(Used to implement clustering algorithms)
http://pdl.perl.org/http://pdl.perl.org/• NSP NSP (Used for extracting features)(Used for extracting features)
http://www.d.umn.edu/~tpederse/nsp.htmlhttp://www.d.umn.edu/~tpederse/nsp.html• SenseTools SenseTools (Used for preprocessing, feature (Used for preprocessing, feature
matching)matching)http://www.d.umn.edu/~tpederse/sensetools.htmlhttp://www.d.umn.edu/~tpederse/sensetools.html
• Cluto Cluto (Clustering Toolkit)(Clustering Toolkit)http://www-users.cs.umn.edu/~karypis/clutohttp://www-users.cs.umn.edu/~karypis/cluto