Filter keywords and majority class strategies for company name disambiguation on Twitter
-
Upload
damiano-spina-valenti -
Category
Technology
-
view
228 -
download
0
description
Transcript of Filter keywords and majority class strategies for company name disambiguation on Twitter
![Page 1: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/1.jpg)
Filter keywords and majority class strategies for company name
disambiguation on Twitter
Damiano Spina, Enrique Amigó and Julio Gonzalo
{damiano,enrique,julio}@lsi.uned.es
UNED NLP & IR Group
CLEF 2011 Conference September 19-22, Amsterdam
![Page 2: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/2.jpg)
![Page 3: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/3.jpg)
![Page 4: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/4.jpg)
![Page 5: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/5.jpg)
Goal
• Two signals coming from intuition:
– Filter keywords
– Majority Class
• Do they help characterizing and solving the problem?
![Page 6: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/6.jpg)
WePS-3 Online Reputation Management Task
![Page 7: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/7.jpg)
WePS-3 Online Reputation Management Task
![Page 8: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/8.jpg)
WePS-3 Online Reputation Management Task
![Page 9: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/9.jpg)
• related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8
Tweets for query «jaguar»
![Page 10: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/10.jpg)
• related tweets=0 • unrelated tweets=10 • Related ratio = 0
Tweets for query «orange»
![Page 11: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/11.jpg)
• related tweets=5 • unrelated tweets=5 • Related ratio = 0.5
Tweets for query «apple»
![Page 12: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/12.jpg)
Fingerprint representation
![Page 13: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/13.jpg)
Fingerprint representation
![Page 14: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/14.jpg)
Fingerprint representation
![Page 15: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/15.jpg)
Fingerprint representation
![Page 16: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/16.jpg)
WePS-3 Task 2 Systems
![Page 17: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/17.jpg)
WePS-3 Task 2 Systems
![Page 18: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/18.jpg)
Filter keywords
![Page 19: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/19.jpg)
Tweets for query «apple»
![Page 20: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/20.jpg)
Tweets for query «apple»
• positive keyword: store • 4 tweets annotated as
«related»
![Page 21: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/21.jpg)
• positive keyword: store • 4 tweets annotated as
«related» • negative keyword: eating
• 2 tweets annotated as «unrelated»
Tweets for query «apple»
![Page 22: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/22.jpg)
• positive keyword: store • 4 tweets annotated as
«related» • negative keyword: eating
• 2 tweets annotated as «unrelated»
• Accuracy= 1.0 • Recall=60%
Tweets for query «apple»
![Page 23: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/23.jpg)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
![Page 24: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/24.jpg)
Company name Positive Keywords Negative Keywords
amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta
fox money, weather, leader, denouncing, viewers
megan, matthew, lazy, valley, michael
ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Oracle keywords (perfects on Twitter)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
![Page 25: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/25.jpg)
Company name Positive Keywords Negative Keywords
amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta
fox money, weather, leader, denouncing, viewers
megan, matthew, lazy, valley, michael
ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Oracle keywords (perfects on Twitter)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
![Page 26: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/26.jpg)
Company name Positive Keywords Negative Keywords
amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta
fox money, weather, leader, denouncing, viewers
megan, matthew, lazy, valley, michael
ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Oracle keywords (perfects on Twitter)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
![Page 27: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/27.jpg)
Upper bound of Filter Keywords
5 oracle keywords ≈ 30% recall
20 oracle keywords ≈ 50% recall
Oracle keywords
![Page 28: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/28.jpg)
Upper bound of Filter Keywords
Manual keywords
– ≈10 per company
– 14.61 % recall (vs. 39.97% 10 oracle keyword)
– 0.86 accuracy
5 oracle keywords ≈ 30% recall
20 oracle keywords ≈ 50% recall
Oracle keywords
![Page 29: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/29.jpg)
Upper bound of Filter Keywords
Manual keywords
– ≈10 per company
– 14.61 % recall (vs. 39.97% 10 oracle keyword)
– 0.86 accuracy
5 oracle keywords ≈ 30% recall
20 oracle keywords ≈ 50% recall
Oracle keywords
Twitter ≠ Web
![Page 30: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/30.jpg)
Majority Class
![Page 31: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/31.jpg)
• related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8
Tweets for query «jaguar»
• Accuracy= 0.80 • Recall=100%
![Page 32: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/32.jpg)
Upper bound of Majority Class
• For each test case /company name
– all unrelated or all related
winner-takes-all
![Page 33: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/33.jpg)
Upper bound of Majority Class
• For each test case /company name
– all unrelated or all related
• Optimal decision
– 0.80 accuracy
winner-takes-all
![Page 34: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/34.jpg)
Upper bound of Majority Class
• For each test case /company name
– all unrelated or all related
• Optimal decision
– 0.80 accuracy • ≈ best manual system
(0.83)
• > best automatic system (0.75)
winner-takes-all
![Page 35: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/35.jpg)
Filter keywords + majority class upperbound
Tweets
Filter keywords (oracle or manual)
Majority Class?
![Page 36: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/36.jpg)
(1) winner-takes-all
Tweets
Filter keywords (oracle or manual)
Majority Class
![Page 37: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/37.jpg)
(2) winner-takes-remainder
Tweets
Majority Class
Filter keywords (oracle or manual)
![Page 38: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/38.jpg)
(3) bootstrapping
Tweets
Machine learning
training
Filter keywords (oracle or manual)
![Page 39: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/39.jpg)
(3) bootstrapping
Tweets
Machine learning
training
Filter keywords (oracle or manual)
application
![Page 40: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/40.jpg)
Filter keywords + majority class
![Page 41: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/41.jpg)
Filter keywords + majority class
≈ ‘all related’ baseline
![Page 42: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/42.jpg)
Filter keywords + majority class baseline
![Page 43: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/43.jpg)
Filter keywords + majority class baseline
Keyword Classification
Terms Filter keywords (automatic)
• Automatic Discovery of Filter Keywords:
![Page 44: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/44.jpg)
Filter keywords + majority class baseline
Keyword Classification
Terms Filter keywords (automatic)
– 13 Term features:
• 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features
– 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc) • Hybrid (Neural net + heuristic’s features)
• Automatic Discovery of Filter Keywords:
![Page 45: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/45.jpg)
Automatic Tweets Classification
0,83 0,75 0,73
0,63 0,56
0,48
accu
racy
WePS-3 systems (automatic)
Filter keywords + Majority Class baseline
WePS-3 systems (manual)
![Page 46: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/46.jpg)
Conclusions
• Fingerprint representation
– Behaviour of binary classification systems on skewed datasets
– Baselines independent of corpus
![Page 47: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/47.jpg)
Conclusions
• Fingerprint representation
– Behaviour of binary classification systems on skewed datasets
– Baselines independent of corpus
• Twitter ≠ Web
– Oracle keywords ≠ Manual keywords
![Page 48: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/48.jpg)
Conclusions
• Fingerprint representation – Behaviour of binary classification systems on skewed
datasets
– Baselines independent of corpus
• Twitter ≠ Web – Oracle keywords ≠ Manual keywords
• Filter keywords & majority class strategies – Useful signals to help solving the problem
– Both signals alone already give competitive performance
![Page 49: Filter keywords and majority class strategies for company name disambiguation on Twitter](https://reader033.fdocuments.us/reader033/viewer/2022060201/559b23f51a28ab54488b45be/html5/thumbnails/49.jpg)
Filter keywords and majority class strategies for company name
disambiguation on Twitter
CLEF 2011 Conference September 19-22, Amsterdam
Damiano Spina, Enrique Amigó and Julio Gonzalo
{damiano,enrique,julio}@lsi.uned.es
UNED NLP & IR Group