Jörg Waitelonis, Henrik Jürges and Harald Sack | Dont compare Apples to Oranges - Extending GERBIL...
-
Upload
semanticsconference -
Category
Technology
-
view
142 -
download
0
Transcript of Jörg Waitelonis, Henrik Jürges and Harald Sack | Dont compare Apples to Oranges - Extending GERBIL...
Don’t compare Apples to Oranges -Extending GERBIL for a fine grained NEL Evaluation
Jörg Waitelonis, Henrik Jürges, Harald Sack
Hasso-Plattner-Institute for IT-Systems Engineering, University of Potsdam
Semantics 2016, Leipzig, Germany, September 12-15th, 2016
Agenda
1. NEL and NEL evaluation
2. Dataset properties and evaluation drawbacks
3. Extending GERBIL
● Building conditional datasets
● Measure dataset characteristics
1. Results
2. Demonstration
3. Summary & Future work
Named Entity Linking (NEL), Principle
Chart 4
“Armstrong landed on the moon.”
Candidates:dbr:Neil_Armstrongdbr:Lance_Armstrongdbr:Louis_Armstrong….
Candidates:dbr:Moondbr:Lunar….
Correct entities
Entity mention with surface form
String DistanceLink AnalysisVector SpaceFuzzy String MatchingConditional Random FieldsRandom ForestRankSVMLearning to RankSurface AggregationWord EmbeddingsContext Similarity Matching
1. Tokenize text2. Find candidates in KB3. Score candidates with a
magic algorithm and select the best one
KEA
Wikifier
● Algorithm only approximates correct
entities
● Need for verification and testing
● Dataset consists out of:
■ Documents (String/Sentences)
■ Annotations (ground truth)
Named Entity Linking, Evaluation
Chart 5
ACE2004
AIDA/CoNLL
DBpedia Spotlight
IITB
KORE50
MSNBC
Micropost2014
N3-RSS-500
WES2015
N3-Reuters-128
● Traditional measures are:
■ Precision: defines how well an annotator works
■ Recall: defines how complete the results are
■ F1-measure: harmonic medium between precision and recall
■ And more, cf. Rizzo et al. [1]
● GERBIL - a general entity annotation system (AKSW Leipzig), cf.
Usbeck et. al [2]
● Used for testing/optimizing/benchmarking annotators
● Neat Webinterface
● 13 Annotators / 20 datasets
● F-measure to rough for detailed
evaluation
● Developer need dataset insights
Named Entity Linking, Benchmarking
Chart 6
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
● Size of a datasets
● Amount of annotations/documents/words
● What types of entities are used? E.g. persons, places, events, ….
● Are there documents without annotations? E.g. Microposts 2014
● What sort of popularity have the entities? E.g. PageRank, Indegree
● How ambiguous are the entities and surface forms?
● How divers are the entities and surface forms?
● ….
Cf. van Erp et al. [3]
Properties of Datasets
Chart 7
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
● How does the dataset characteristics influence the evaluation
results?
● How does the popularity of entities influence the evaluation
results?
● How can a general dataset be used for domain specific NEL tools?
● How can datasets be compared? Is there something like a general
difficulty?
● Limited comparability between benchmark results
● Penalization of good annotators with inappropriate datasets
Cf. van Erp et al. [3]
Research Questions and Drawbacks
Chart 8
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
● Approach for a solution:
● Adjustable filter system for GERBIL
● Expose dataset characteristics
● Datasets and annotators added at runtime are also included
● Visualize the results
Extending GERBIL
Chart 9
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
Extending GERBIL, Conditional Datasets
Chart 10
Dataset
Type and popularity specific datasets
Annotator
Documents Annotator results
Benchmark results
Evaluate each specific dataset and result
PR(e) > t
PR(e) > t
PR(e) > t
rdf:type
rdf:type
rdf:type
rdf:type
rdf:type
rdf:type
Results, Types
Chart 11
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
Extending GERBIL, Not Annotated Documents
● Not annotated documents: shows the relative amount of empty
documents within a datasets
● Only affects if annotators searches entity mentions by themselves
Chart 13
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
Extending GERBIL, Density
Chart 14
● Density: shows the relation between number of annotations and
words in the document
● Only affects if annotators searches entity mentions by themselves
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
Extending GERBIL, Likelihood of Confusion
● Likelihood of Confusion (Level of Ambiguity)
● True measures are unknown due to missing exhaustive collections
● Rough overview how difficult to disambiguate
Chart 15
Entities Surface Forms
Tegel
TXL
Bruce
Otto Lilienthal
Bruce Lee
Bruce Willis
Airport Tegel
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
synonyms
Results, Likelihood of Confusion
Chart 16
Entities
● A high red bar indicates
an entity has a high
amount of homonyms
● A high blue bar indicates
a surface form has a high
amount of synonyms
Surface Forms
Extending GERBIL, Dominance of Entities
Chart 17
BruceBruci
Bruce Willis
Testdata
Vocabulary
dominance(e)= e(t)/e(v)
● Expresses the relation between
used words and all words
● True measures unknown
● High rates prevents overfitting
● Prevents repetition of surface
forms
dbr:Bruce_Willis
Bruce Walter Willis
Extending GERBIL, Dominance of Surface Forms
Chart 18Chart 18
dbr:Irene_Angelina
dbr:Angelina_Jordan
dbr:Angelina_Jolie
Vocabulary
dominance(s)= s(t)/s(v)
● Expresses the relation between
used mentions and all
mentions
● True measures unknown
● High rates prevents
overfitting
● Indicates how context
dependent a disambiguation is
Testdata
Angelina
Results, Dominance
Chart 19
● Blue bar indicates that for a
entity a variety of surface
forms is used
● Red bar indicates how context
dependent the disambiguation of
an surface form is
Dominance of surface forms Dominance of entities
Demo
● http://gerbil.s16a.org/
● https://github.com/santifa/gerbil/
Chart 20
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
■ Summary:
□ Implemented a domain specific filter system
□ Measure dataset characteristics
□ Annotator results are nearly the same on entities of different
popularity
□ Enable specific analyses and optimization of annotators
□ Enable users to select the tools that performs best for a specific
domain
■ Future work:
□ Keep up with GERBIL development, increase performance
□ More measurements, e. g. max_recall
□ Dataset remixing ≙ assemble new customized datasets
– E. g. Unpopular companies
Summary & Future work
Chart 21
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation
[1] Giuseppe Rizzo, Amparo Elizabeth Cano Basave, Bianca Pereira, and Andrea
Varga. Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge. In 5th Workshop on Making Sense of Microposts (#Microposts2015), pages 44–53. CEUR-WS.org, 2015
[2] M. Röder, R. Usbeck, and A.-C. Ngonga Ngomo. Gerbil’s new stunts: Semantic
annotation benchmarking improved. Technical report, Leipzig University, 2016
[3] M. van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis.
Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job. In Proc. of 10th edition of the Language Resources and Evaluation Conference, Portoroz, Slovenia, 2016.
References
Chart 22
Henrik Jürges Semantics 2016Leipzig, Germany
Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation