Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on...
Transcript of Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on...
Comprehensive EvaluationPRIYA RADHAKRISHNAN, 201050035
INTRODUCTION Motivation
Entity Linking &
Knowledgebase Enhancement
India successfully sends 'MOM' to Mars
India successfully sends 'MOM' to Mars
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 2
INTRODUCTION Problem StatementIndia successfully sends 'MOM' to Mars
Mention Detection, Disambiguation and Linking
Entity Categorisation and KB Enhancement
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 3
Outline
Definition and Background Information
Literature Survey
Some Applications of EL
Future Work
Q & A
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 4
Entity Linking and KB Enhancement
Mention DetectionKnowledgebase Construction
Entity Categorization
Entity Linking
HYPERLINKED TEXT
INPUT TEXT
Disambiguation
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 5
Literature Survey1. Entity Linking
2. Measuring Semantic Relatedness
3. Entity Linking in documents
4. Entity Linking in short texts
5. Entity Linking Evaluation
6. Knowledgebase Creation
7. Knowledgebase Enhancement
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 6
Literature Survey : Entity LinkingP0.Bunescu, R., Pasca.M: Using encyclopedic knowledge for named entity disambiguation. In EACL. (2006)
P1. Mihalcea, R., Csomai, A.: Wikify!: Linking documents to encyclopedic knowledge. In CIKM (2007)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 7
Using encyclopedic knowledge for named entity disambiguationAuthor : Bunescu, R, Pasca.M.
Aim : Detect Named entity in text and disambiguate to their named entity denotations in Wikipedia.
Approach : Disambiguation uses a SVM Kernel trained on Wikipedia features from Context and category.
Results : Using 1,783K dataset of ambigues NEs in Wikipedia, the report accuracy of 68% (on 21K queries) and 84%(on 31K queries).
Contribution : This work of Bunescu and Pasca is widely accepted as the first work in this area.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 8
P01. Mention Detection -Detects whether a proper name refers to a named entity included in the dictionary.
2. Disambiguation = Context-article Similarity + Word – category similarity
score(q, 𝑒𝑛) = cos(q.T , 𝑒𝑛.T) =𝑞.𝑇
𝑞.||𝑇||. 𝑒𝑛.𝑇
𝑒𝑛.||𝑇||,
𝑓(q, 𝑒𝑛) = {0,1}
Disambiguation score = arg𝑚𝑎𝑥𝑛 𝑠𝑐𝑜𝑟𝑒(𝑞, 𝑒𝑛)
3. Detect unlikable or out-of-Wikipedia entities. Disambiguation produces a confidence score and linking succeeds if the disambiguation score is above threshold. When it is below threshold, the entity is described as out of Wikipedia.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 9
P1 : Wikify!: Linking documents to encyclopedic knowledgeAuthor : Mihalcea, R., Csomai, A.
Aim : Use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation.
Approach : Assess the keyphraseness of each entity mention, disambiguates using context.
Results : It shows how Wikipedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how the two methods can be combined into a system able to automatically enrich a text with links to Wikipedia. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages.
Contribution : Link any entity mention appearing in text, not just the named entities, to their named entity denotations in Wikipedia.
In 2008, Medelyn et al used Wikipedia as a hyperlinked encyclopaedia. They defined Commonness as ratio of number of links with specific target and anchor text to the total number of links with that anchor text.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 10
P11.Keyword extraction(mention detection) They defined a measure called keyphraseness
Keyphraseness =𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑤𝑎𝑠 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑎𝑠 𝑘𝑒𝑦𝑤𝑜𝑟𝑑
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑒𝑑
2. Link generation ( link ambiguity resolved using context).
In disambiguation two approaches were tried. One was a knowledge-base based approach, where overlap of contexts between q and e was used.
Second was a data driven algorithm - a naïve Bayes classifier is trained on local (3 words to left + right, POS of neighbours) and global (five keyphrases occurring at least 3 times in the contexts defining this word sense) to predict linkability.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 11
Literature Survey : Measuring Semantic RelatednessP11. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In AAAI (2006)
P2. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (2007)
P3. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In AAAI (2008)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 12
Semantic Relatedness
Semantic relatedness: how much words/texts are correlated in meaning to each other.
word 1 /text 1 ←→ word 2 /text 2
cricket ←→ sport
Domesticated Animals ↔ Pet Mammals
Producers ↔ Actors ↔ Directors
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 13
P11 : Wikirelate! computing semantic relatedness using wikipedia.Author : Strube, M., Ponzetto, S.P.
Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.
Approach : Apply well established semantic relatedness measures originally developed for WordNet to the open domain encyclopaedia Wikipedia.
WorNet measures include Leacock & Chodorow (1998), Wu & Palmer(1994), Resnik(1995), Lesk and Banerjee & Pederson(2003)
Google based sr measure = 𝐻𝑖𝑡𝑠 ( 𝑖 𝐴𝑁𝐷 𝑗 )
𝐻𝑖𝑡𝑠 𝑖 +𝐻𝑖𝑡𝑠 𝑗 −𝐻𝑖𝑡𝑠 ( 𝑖 𝐴𝑁𝐷 𝑗)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 14
P11Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts. It also shows that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. The best results on this dataset are obtained by integrating Google, WordNet and Wikipedia based measures.
Results : This work established that existing relatedness measures perform better using Wikipedia than a baseline given by Google count.
Contribution : Computing SR requires a semantic resource. Wordnet was the de-facto semantic resourse for calculating Sematic Relatedness until in 2006 the concept of using Wikipedia as a semantic source surfaced in this work of Strubeand Ponzetto.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 15
P2 : Computing semantic relatedness using wikipedia-based explicit semantic analysis.Author : Gabrilovich, E., Markovitch, S
Aim : Create a semantic interpretation of words occurring as Wikipedia titles.
Approach : ESA represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia (i.e Wikipedia titles). Using machine learning techniques, explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts.
Assessing the relatedness of texts in this space is comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 16
P2
They build an inverted index, which maps each word into a list of concepts in which it appears. Given a text fragment, the semantic interpreter ranks all the Wikipedia concepts by their relevance to the fragment as a vector.
Semantic Relatedness (SR) of a pair of text fragments is the cosine metric between their vectors.
Results : Proposed SR measure, ESA achieved the highest correlation with human (0.75)
Contribution : SR measure achieved the highest correlation with human (0.75) thus far. However the method requires processing whole Wikipedia text.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 17
P3 : An effective, low-cost measure of semantic relatedness obtained from Wikipedia links.Author : Milne, D., Witten, I.H.:
Aim : Proposed the Wikipedia Link based Measure (WLM) for computing Semantic Relatedness
Approach : Uses only the hyperlink structure of Wikipedia
Results : WLM achieved a correlation of 0.68 with human.
Contribution : The approach uses the hyperlink structure of Wikipedia rather than its category hierarchy( as in P11) or textual content(as in P2). Evaluation with manually defined measures of semantic relatedness reveals this approach to be an effective compromise between the ease of computation of the former(P11) and the accuracy of the latter(P2).
In their subsequent work this was expanded to measure relatedness between entities and used in Entity Linking
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 18
P3WLM is cheaper and effective: cheaper, because Wikipedia’s extensive textual content can largely be ignored, and effective, because it is more closely tied to the manually defined semantics of the resource.
Candidate articles of a term are identified using anchor .
anchors—the terms or phrases in Wikipedia articles texts to which links are attached.
Two SR measures
w(a ->b) = 𝑙𝑜𝑔|𝑇|
|𝑊|
sr(a,b) =log max 𝐴,𝐵 −log(𝐴∩𝐵)
log 𝑊 −log(min(𝐴,𝐵))
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 19
Literature Survey : Entity Linking in documentsP4. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In CIKM (2008)
P5. Cucerzan, S.: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In EMNLP-CoNLL(2007)
P6. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In SIGKDD (2009)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 20
P4 : Learning to link with Wikipedia.Author : Milne, D., Witten, I.H.:
Aim : Automatically cross-reference documents with Wikipedia
Approach : Learn a disambiguator using WLM and Comonnness as features. This disambiguator is used to detect mentions and link them.
Results : Disambiguator (F=97.1), Link Detection (F = 75) , Accuracy of the detected links = 76.4%.
Contribution : The key difference in approach was that it uses disambiguation to inform detection, whereas conventional approach was to do detection first and then do disambiguation.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 21
P4Disambiguation followed by mention detection!
Mention Detection : Document ->n-grams ->remove infrequent and stopwords
Disambiguation : Disambiguate the cleaned n-grams using two features :Relatedness or WLM and link probability [p1]. A linear combination is achieved by preferentially weighing WLM , when good context is available and link probability when less context is available(choose most common sense) using a classifier (B4.5 algorithm).
Linking : The features WLM, link probability, disambiguation score, Generality( min depth at which topic is located in the Wikipedia category tree), Location and spread of topics (in the Wikipedia page) are used to train a Naïve Bayes classifier to detect link (whether to link or not).
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 22
P5 : Large-Scale Named Entity Disambiguation Based on Wikipedia Data.Author : Cucerzan, S.
Aim : A large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from Wikipedia data and Web search results.
Approach : Maximize the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities.
Disambiguation score = arg𝑚𝑎𝑥 𝑛=1𝑁 𝑏𝑛 ∣ 𝐶, 𝑑 + 𝑛=1 𝑚=1
𝑁 𝑏𝑛 ∣ 𝑇, 𝑏𝑚 ∣ 𝑇
Result: The implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles.
Contribution : The first wikification system to map all named entities in a text simultaneously to trap the coherence among the entities, in the disambiguation of the detected entity mention. Evidence from the context of the mention is combined with that from the category tag of the mention to do disambiguation.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 23
P5Contextual information extracted from Wikipedia includes
1. the known entities (most articles in Wikipedia are associated to an entity/concept),
2. their entity class when available (Person, Location, Organization, and Miscellaneous),
3. their known surface forms (terms that are used to mention the entities in text),
4. contextual evidence (words or other entities that describe or co-occur with an entity), and
5. category tags (which describe topics to which an entity belongs to).
Mention Detection – Sentence and Entity boundary identification and type(PER, ORG, LOC, OTH)
Disambiguation - a vectorial representation of the processed document is compared with the vectorial representations of the Wikipedia entities.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 24
P5In mention detection, NEs are identified and the system retrieves all possible entity disambiguation of each NE.
Wikipedia contexts that occur in the document and their category tags are aggregated into a document vector, which is subsequently compared with the Wikipedia entity vector (of categories and contexts) of each possible entity disambiguation.
Choose the assignment of entities to surface forms that maximizes the similarity between the document vector and the entity vectors
Disambiguation score = arg𝑚𝑎𝑥 𝑛=1𝑁 𝑏𝑛 ∣ 𝐶, 𝑑 + 𝑛=1 𝑚=1
𝑁 𝑏𝑛 ∣ 𝑇, 𝑏𝑚 ∣ 𝑇
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 25
P6.Collective annotation of Wikipedia entities in web text.Author : Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.
Aim : Link entity mentions on Web pages to entities in Wikipedia.
Approach : This paper proposes a general collective disambiguation approach. On the premise that coherent documents refer to entities from one or a few related topics or domains, the authors propose formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. The proposed solution is based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters.
Result: In experiments involving over a hundred manually-annotated Web pages and tens of thousands of entity mentions, the approach significantly outperforms other existing algorithms.
Contribution : They built a manually curated dataset for evaluating EL. Achieved F1 = 69.69. Both P4 and P5 avoid direct joint optimization of all spot labels, which is done here. This system achieved higher disambiguation accuracy though at a higher computational cost.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 26
P6Wikipedia is preprocessed so that each page corresponding to an entity γ is represented by four fields.
• Text from the first descriptive paragraph of γ.
• Text from the whole page for γ.
• Anchor text within Wikipedia for γ.
• Anchor text and five tokens around it.
Each field is turned into a bag (multiset) of words. Three text match scores are computed between a field of γ and s:
• Dot-product between word count vectors.
• Cosine similarity in TFIDF vector space.
• Jaccard similarity between word sets. So in all, we get 4 × 3 = 12 features.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 27
P6
Coherence Score =
This optimization is converted first into a 0/1 integer linear program by considering NA. Then it is relaxed into a LP or using rounding.
1
𝑆 𝐶2
𝑠!=𝑠′Ɛ𝑆
𝑟 𝑦𝑠 , 𝑦𝑠′ +1
𝑆
𝑠Ɛ𝑆
𝑤 𝑓 𝑦𝑠
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 28
Literature Survey : Entity Linking in short textsP7. Ferragina, P., Scaiella, U.: TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In CIKM (2010)
P8. Meij, E., Weerkamp, W., de Rijke, M.: Adding Semantics to Microblog Posts. WSDM (2012)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 29
P7: TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). Author : Ferragina, P., Scaiella, U.
Aim : Uses Wikipedia's anchor text to page mapping to address the problem of cross-referencing text fragments with Wikipedia pages. This way synonymy and polysemy issues are resolved accurately and efficiently.
Approach : Uses Keyphraseness (P1) for mention detection and WLM (P4) for disambiguation.
Result: good results on both long documents (F = 78.2) and short texts fragments(F=77.9) i.eweb snippets and micro-blogging (namely, tweets) .
Contribution : On short texts, context is sparse. This is counted as state-of-the-art in Wikification systems.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 30
P71. Relatedness between pages
2. Disambiguation for a mention a from candidate sense Pa, rel(Pa).3. Linking :
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 31
P8: Adding Semantics to Microblog Posts.Author : Meij, E., Weerkamp, W., de Rijke, M..
Aim : Determine concept of a microblog post (tweet) through semantic linking.
Approach : Combine concept ranking method (which gives high Recall) which generate a ranked list of candidate concepts, with supervised machine learning method (that gives high Precision) to predict concept of tweet.
Result: Achieved MRR(0.708) on the published dataset.
Contribution : A reusable datset.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 32
P8Approach :
1. Mention detection and link generation : obtain a ranked list of candidate concepts for each n-gram in a tweet.
2. Disambiguation : Determine which of the candidate concepts to keep ( a comparison of methods for the initial concept ranking step; lexical matching, language modelling and compare their effectiveness). Supervised Learner.
Features :
N-gram features : IDF(q), WIG(q), SNIL(q), SNCL, Link Probability, Keyphraseness
Concept features : Inlinks, Outlinks, Redirect, WikiCat
Tweet features : TWCT, TWCQ, URL, TAGDEF
Dataset : A manually annotated dataset of tweet –to- concept was created.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 33
Literature Survey : Entity Linking EvaluationP9. Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-annotation systems. In WWW 2013
P10. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating entity linking with Wikipedia. Artif. Intell. (2013)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 34
P9: A framework for benchmarking entity-annotation systems. Author : Cornolti, M., Ferragina, P., Ciaramita, M..
Aim : Presents a benchmarking framework for fair and exhaustive comparison of entity-annotation systems.
Approach : Definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available data-sets, containing texts of various types such as news, tweets and Web pages. Problems fall into D2W, A2W, Sa2W, C2W, Sc2W or Rc2W
Result: Comparison of publicly available entity annotation systems namely : AIDA, Illinois Wikifier, Tagme, Wikipedia-Miner and Dbpedia-Spotlight
Contribution : Classification of entity linking systems into D2W, A2W and the evaluation measures defined here became well accepted as standards.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 35
P9
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 36
P9
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 37
P9
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 38
P10: Evaluating entity linking with Wikipedia. Author : Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.
Aim : This paper re-implements three seminal Named Entity Linking (NEL) systems and presents a detailed evaluation of mention detection strategies. The results are systematically compared on standard data sets. The results establish that co-reference and acronym handling lead to substantial improvement, and mention detection strategy account for much of the variation between systems.
Approach : Compare Bunescu & Pasca (P0), Cucerzan(P5) and Varma (IIITHyderabad at TAC 2009) systems
Result: First direct comparison of three systems.
Contribution : Mention detection strategies account for much of the variation between systems compared to disambiguation methods.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 39
P10Review of Named Entity Disambiguation Tasks and Data Sets
NEL system = Extractor + Searcher + Disambiguator
Extractors – alias source
Searcher - effect of coreference, acronym handling and query length
Disambiguator –cosine similarity outdid scalar product and SVM rank
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 40
RECAP : Entity Linking and KB Enhancement
Mention DetectionKnowledgebase Construction
Entity Categorization
Entity Linking
HYPERLINKED TEXT
INPUT TEXT
Disambiguation
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 41
Literature Survey : Knowledgebase CreationP12. Zesch, T., Gurevych, I.: Analysis of the Wikipedia category graph for nlp applications. In TextGraphs-2 Workshop of NAACL-HLT(2007)
P13. Suchanek, F.M., Kasneci, G.,Weikum, G.: Yago: A core of semantic knowledge. In WWW (2007)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 42
RECAP : Semantic Relatedness
Semantic relatedness: how much words/texts are correlated in meaning to each other.
word 1 /text 1 ←→ word 2 /text 2
cricket ←→ sport
Domesticated Animals ↔ Pet Mammals
Producers ↔ Actors ↔ Directors
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 43
RECAP : P11 : Wikirelate! computing semantic relatedness using wikipedia.Author : Strube, M., Ponzetto, S.P.
Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.
Approach : Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts. It also shows that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. The best results on this dataset are obtained by integrating Google, WordNet and Wikipedia based measures.
Results : This work established that existing relatedness measures perform better using Wikipedia than a baseline given by Google count.
Contribution : Computing SR requires a semantic resource. Wordnet was the de-facto semantic resourse for calculating Sematic Relatedness until in 2006 the concept of using Wikipedia as a semantic source surfaced in this work of Strube and Ponzetto.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 44
P12 : Analysis of the Wikipedia category graph for NLP applications.Author : Zesch, T., Gurevych, I.
Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.
Approach : Compare the two graphs in Wikipedia (i) the article graph, and (ii) the category graph. Using graph theoretic analysis of the category graph, the authors show that Wikipedia Category Graph is a scale-free, small world graph like other well-known lexical semantic networks.
Results : WordNet based SR measures are adapted to Wikipedia Category Graph. German WordNet (a.k.a GermaNet) gives best correlation with human judgement on SR datasets.
Contribution : First published non-Englisg ( German ) SR dataset.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 45
P12
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 46
P12WorNet measures include
Path Length, Leacock & Chodorow (1998), Wu & Palmer(1994), Resnik(1995), Lin(1998), IIC (Intrinsic Information Content)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 47
P13 : Yago: A core of semantic knowledge.Author : Suchanek, F.M., Kasneci, G.,Weikum, G.
Aim : YAGO is a light-weight and extensible ontology with high coverage and quality. YAGO contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).
Approach : The facts were automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods.
Results : Empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS.
Contribution : First ontology using Wikipedia + WordNet
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 48
P13
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 49
P13
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 50
Literature Survey : Knowledgebase EnhancementEntity Attribute Extraction
P14. Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. SIGKDD 2006
Structured Information Extraction
Wu, F., Weld, D.S.: Autonomously semantifying Wikipedia. In CIKM (2007)
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 51
Autonomously semantifying Wikipedia.Author : Wu, F., Weld, D.S.
Aim : Automatically enhance structures in Wikipedia like link structure, taxonomic data, infoboxes, etc..(KYLIN)
Approach: Uses a self-supervised, machine learning system. KYLIN looks for classes of pages with similar infoboxes, determines common attributes, creates training examples, learns CRF extractors, and runs them on each page — creating new infoboxes and completing others.
KYLIN also automatically identifies missing links for proper nouns on each page, resolving each to a unique identifier.
Results : Experiments show that the performance of KYLIN is roughly comparative with manual labelling in terms of precision and recall.
Contributions : System or API is not publically available.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 52
KYLIN:KYLIN looks for classes of pages with similar infoboxes, determines common attributes, creates training examples, learns CRF extractors, and runs them on each page — creating new infoboxes and completing others.
KYLIN also automatically identifies missing links for proper nouns on each page, resolving each to a unique identifier.
Experiments show that the performance of KYLIN is roughly comparative with manual labelling in terms of precision and recall. On one domain, it does even better.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 53
P14 : Text mining for product attribute extractionAuthor : Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.
Aim : Extracting attribute and value pairs from textual product descriptions. The goal is to augment databases of products by representing each product as a set of attribute-value pairs.
Approach: Problem is formulated as a classification problem and solved using semi-supervised learning algorithms.
Results : results on apparel and sporting goods product descriptions
Contribution : Representing product as A-V pairs is beneficial for tasks where treating the product as a set of attribute-value pairs is more useful than as an atomic entity. Used in applications like demand forecasting, assortment optimization, product recommendations, and assortment comparison across retailers and manufacturers.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 54
P14First system extracts implicit (semantic) attributes that are implicitly mentioned in descriptions.
Semantic Attribute Extraction:
1. Dataset : crawled from apparel retail websites
2. Define a set of semantic attributes that would be useful to extract for each product
3. a small subset (600 products) was given to a group of fashionaware people to label
4. create one text classifier for each semantic attribute(Na¨ıve Bayes)
5. use the Expectation- Maximization algorithm to combine labeled and unlabeled data.
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 55
P14Second system extracts explicit attributes from product descriptions. These attributes associated with products are explicit physical attributes such as size and colour. The attribute-value pairs are explicitly mentioned in the data. Both the data populates a knowledge base with these products and attributes.
Explicit Attribute Extraction:
1. Data Collection from an internal database or from the web using web crawlers and wrappers, as done in the previous section.
2. Seed Generation either by generating them automatically or by obtaining human-labeled training data.
3. Attribute-Value Entity Extraction using a semi-supervised co-EM algorithm, because it can exploit the vast amounts of unlabelled data that can be collected cheaply.
4. Attribute-Value Pair Relationship Extraction by associating extracted attributes with corresponding extracted values. They use a dependency parser to establish links between attributes and values as well as correlation scores between words.
5. User Interaction to correct the results as well as to provide training data for the system to learn from using active learning techniques..
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 56
RECAP : Literature Survey1. Entity Linking
2. Measuring Semantic Relatedness
3. Entity Linking in documents
4. Entity Linking in short texts
5. Entity Linking Evaluation
6. Knowledgebase Creation
7. Knowledgebase Enhancement
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 57
Applications
1. Information Extraction
2. Information Retrieval
3. Content Analysis
4. Question Answering
5. Knowledge Base Population
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 58
Our Attempts and Future Directions
1. Entity Linking
•Documents – TAC KBP Task 2014
•Tweets – NEEL Challenge @ WWW’14
•Search queries – ERD Challenge @ SIGIR’14
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 59
Our Attempts and Future Directions
2. Semantic Relatedness – Using Wikipedia Category Network.
"Extracting Semantic Knowledge from Wikipedia Category Names " in Proceedings of the 3rd Workshop on Knowledge Extraction ( AKBC 2013) at CIKM 2013
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 60
Our Attempts and Future Directions
3.Entity Attribute Extraction – From Product title
"Modeling Evolution of Product Entities" in proceedings of the ACM SIGIR 2014 Conference
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 61
SUMMARYIndia successfully sends 'MOM' to Mars
COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 62
Mention Detection
Knowledgebase Construction
Entity Categorization
Entity Linking
HYPERLINKED TEXT
Disambiguation
INPUT TEXT
1. Entity Linking
2. Measuring Semantic Relatedness
3. Entity Linking in documents
4. Entity Linking in short texts
5. Entity Linking Evaluation
6. Knowledgebase Creation
7. Knowledgebase Enhancement
Q&ATHANKYOU