Pseudo-static vs. pseudo-dynamic slope stability analysis in seismic ...
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
-
Upload
david-graus -
Category
Science
-
view
106 -
download
0
description
Transcript of Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams
David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
What is "anema"?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
For content interpretation and complex filtering tasks we want to know who/what people talk about.
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Entity Linking
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TANK (VEHICLE)
KnowledgeBase (KB)
Document r
TANKquery q
?
?
TANK JOHNSON
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenges
1. Entity "importance"
2. Noisy & short text (Twitter), updates in the KB
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1: Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
!
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1: Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
Q: How do you know an entity is important or has impact?
A: If it is in Wikipedia, it is/has
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1: Entity Importance
Can we leverage today's entities to learn to predict tomorrow's entities?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1: Entity Importance
Can we leverage today's entities to learn to predict tomorrow's entities?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74
Challenge 1: Entity Importance
Can we leverage today's entities to learn to predict tomorrow's entities?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74/140
Challenge 2: Noisy data & changing KB
Unsupervised method for generating pseudo-ground truth (for training NER)
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Assumption
A named-entity recognizer trained only on KB entities will learn to recognize KB entities
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti-America Rant http://on.cc.com/1htk9Wo
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.product
organizationorganization
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
NERC
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
Predictions
m1, c1m2, c2
…
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
Predictions
m1, c1m2, c2
…
Today's KB
small KB
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
FutureKB
Unlabeled Tweet
?
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
Predictions
m1, c1m2, c2
…
Today's KB
small KB
FutureKB
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Unlabeled Tweet
?
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
Predictions
m1, c1m2, c2
…
Today's KB
full KB
small KB
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
�����������
m1, e1m2, e2
Unlabeled Tweet
?
SampleCorpus
Training data
m1, c1m2, c2
NERC
NERCModel
Predictions
m1, c1m2, c2
…
Today's KB
FutureKB
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
Unlabeled Tweet
? NERCModel
Predictions
m1, c1m2, c2
…
Today's KB
FutureKB
Ground Truth
m1, c1m2, c2
…
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Tweet EntityLinker
Unlabeled Tweet
? NERCModel
Predictions
m1, c1m2, c2
…
Today's KB
FutureKB
Ground Truth
m1, c1m2, c2
…
Evaluate
Evaluation
• Mention level (NER style)
• Entity level
!
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
Prediction
• Mention level (NER style)
• Entity level
!
• Mention level (NER style)
• Entity level
!
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
• Mention level (NER style)
• Entity level
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPredictionGround Truth
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPredictionGround Truth
Experimental setup
Data:
Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)
KB: Wikipedia (Jan 4th, 2012)
!
Components:
EL: Semanticizer
NERC: [email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 },
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
B I O B I
O O O O O O
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
person person
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
�����������
m1, e1m2, e2
FutureKB
NERC
Tweet NERCModel
Unlabeled Tweet
?
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
Predictions
m1, c1m2, c2
…
EntityLinker
SampleCorpus
Training data
m1, c1m2, c2
From tweet to training sample
1. Convert EL output (Wikipedia concepts) to NERC labels;
• Label entity span (B-I-O) & class (PER/LOC/ORG)
!
2. Pick "good" samples
• entity linker's confidence score
• textual quality
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 },
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
Entity Class
1. Map Wikipedia entity to DBpedia entity
2. Retrieve entity class (ontology);
• if Person: PER
• if Organisation, Company, or Non-ProfitOrganisation: ORG
• if Place, PopulatedPlace, City, Country: LOC
• …?
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling Methods
1. Entity linker confidence score
2. Textual quality
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Extract anchor text (a) to Wikipedia page (W)-mappings
• Confidence score combines two signals:
1. How common is it that a is used as a link
2. How commonly is a used as a link to W
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Higher threshold = fewer entities, less noise
• Lower threshold = fewer entities, more noise
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Highest scoring tweets
1. Watching the History channel, Hitler’”⁹s
Family. Hitler hid his true family
heritage, while others had to measure
up to Aryan purity.
2. When you sense yourself becoming
negative, stop and consider what it
would mean to apply that negative
energy in the opposite direction.
3. So. After school tomorrow, french
revision class. Tuesday, Drama
rehearsal and then at 8, cricket
training. Wednesday, Drama. Thursday
… (c)
Lowest scoring tweets
1. Toni Braxton ~ He Wasn't Man
Enough for Me _HASHTAG_
_HASHTAG_? _URL_ RT _Mention_
2. tell me what u think The GetMore
Girls, Part One _URL_
3. this girl better not go off on me rt
Sampling 2: Textual quality
• Compare different sampling strategies;
• top tweets
• medium tweets
• medium+top tweets
• low+medium+top tweets (no sampling)
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Results
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
�����������
m1, e1m2, e2
NERC
NERCModel
Unlabeled Tweet
?EntityLinker
FutureKB
Tweet
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus Sample
Corpus
Predictions
m1, c1m2, c2
…
RQ1: What is the impact of our sampling methods for generating pseudo-ground truth?
Training data
m1, c1m2, c2
Findings: EL confidence score threshold1. Higher threshold, higher accuracy
!
!
!
!
!
!
!
Solid: Precision Dotted: Recall
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
Findings: EL confidence score threshold
2. Higher threshold, more predictions
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Findings: Textual Quality Sampling
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
FutureKB
Tweet
Unlabeled Tweet
? NERCModel
NERC
Training data
m1, c1m2, c2
SampleCorpus
�����������
m1, e1m2, e2
TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus
RQ2: What is the impact of the size of prior knowledge on detecting unknown entities?
EntityLinker
Today's KB
Predictions
m1, c1m2, c2
…
Results: RQ2
Sampling 2: KB size (mentions)
!
!
!
!
!
!
blue: Our method red: Baseline
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
50"
20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
Conclusions
Recall increases as amount of prior knowledge grows:
1. Able to deal with missing labels, justifying approach
2. Rate of unknown entity detection increases as KB grows
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future Work
• Next step: Closing the loop
• Feed back to KB (entity normalization)
• From PER/LOC/ORG entities to other classes:
• Books, buildings, drugs, artists, …?
• Apply to other domains, languages
• From random sampling to time-based sampling
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Fin
Questions?
!
!
!
!
!
!
!
[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media