Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on...

Comprehensive EvaluationPRIYA RADHAKRISHNAN, 201050035

INTRODUCTION Motivation

Entity Linking &

Knowledgebase Enhancement

India successfully sends 'MOM' to Mars

India successfully sends 'MOM' to Mars

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 2

http://en.wikipedia.org/wiki/India

http://en.wikipedia.org/wiki/Mars_Orbiter_Mission

http://en.wikipedia.org/wiki/Mars

INTRODUCTION Problem StatementIndia successfully sends 'MOM' to Mars

Mention Detection, Disambiguation and Linking

Entity Categorisation and KB Enhancement





Outline

Definition and Background Information

Literature Survey

Some Applications of EL

Future Work

Q & A


Entity Linking and KB Enhancement

Mention DetectionKnowledgebase Construction

Entity Categorization

Entity Linking

HYPERLINKED TEXT

INPUT TEXT

Disambiguation


Literature Survey1. Entity Linking

2. Measuring Semantic Relatedness

3. Entity Linking in documents

4. Entity Linking in short texts

5. Entity Linking Evaluation

6. Knowledgebase Creation

7. Knowledgebase Enhancement


Literature Survey : Entity LinkingP0.Bunescu, R., Pasca.M: Using encyclopedic knowledge for named entity disambiguation. In EACL. (2006)

P1. Mihalcea, R., Csomai, A.: Wikify!: Linking documents to encyclopedic knowledge. In CIKM (2007)


Using encyclopedic knowledge for named entity disambiguationAuthor : Bunescu, R, Pasca.M.

Aim : Detect Named entity in text and disambiguate to their named entity denotations in Wikipedia.

Approach : Disambiguation uses a SVM Kernel trained on Wikipedia features from Context and category.

Results : Using 1,783K dataset of ambigues NEs in Wikipedia, the report accuracy of 68% (on 21K queries) and 84%(on 31K queries).

Contribution : This work of Bunescu and Pasca is widely accepted as the first work in this area.


P01. Mention Detection -Detects whether a proper name refers to a named entity included in the dictionary.

2. Disambiguation = Context-article Similarity + Word – category similarity

score(q, 𝑒𝑛) = cos(q.T , 𝑒𝑛.T) =𝑞.𝑇

𝑞.||𝑇||. 𝑒𝑛.𝑇

𝑒𝑛.||𝑇||,

𝑓(q, 𝑒𝑛) = {0,1}

Disambiguation score = arg𝑚𝑎𝑥𝑛 𝑠𝑐𝑜𝑟𝑒(𝑞, 𝑒𝑛)

3. Detect unlikable or out-of-Wikipedia entities. Disambiguation produces a confidence score and linking succeeds if the disambiguation score is above threshold. When it is below threshold, the entity is described as out of Wikipedia.


P1 : Wikify!: Linking documents to encyclopedic knowledgeAuthor : Mihalcea, R., Csomai, A.

Aim : Use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation.

Approach : Assess the keyphraseness of each entity mention, disambiguates using context.

Results : It shows how Wikipedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how the two methods can be combined into a system able to automatically enrich a text with links to Wikipedia. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages.

Contribution : Link any entity mention appearing in text, not just the named entities, to their named entity denotations in Wikipedia.

In 2008, Medelyn et al used Wikipedia as a hyperlinked encyclopaedia. They defined Commonness as ratio of number of links with specific target and anchor text to the total number of links with that anchor text.


P11.Keyword extraction(mention detection) They defined a measure called keyphraseness

Keyphraseness =𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑤𝑎𝑠 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑎𝑠 𝑘𝑒𝑦𝑤𝑜𝑟𝑑

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑒𝑑

2. Link generation ( link ambiguity resolved using context).

In disambiguation two approaches were tried. One was a knowledge-base based approach, where overlap of contexts between q and e was used.

Second was a data driven algorithm - a naïve Bayes classifier is trained on local (3 words to left + right, POS of neighbours) and global (five keyphrases occurring at least 3 times in the contexts defining this word sense) to predict linkability.


Literature Survey : Measuring Semantic RelatednessP11. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In AAAI (2006)

P2. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (2007)

P3. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In AAAI (2008)


Semantic Relatedness

Semantic relatedness: how much words/texts are correlated in meaning to each other.

word 1 /text 1 ←→ word 2 /text 2

cricket ←→ sport

Domesticated Animals ↔ Pet Mammals

Producers ↔ Actors ↔ Directors


P11 : Wikirelate! computing semantic relatedness using wikipedia.Author : Strube, M., Ponzetto, S.P.

Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.

Approach : Apply well established semantic relatedness measures originally developed for WordNet to the open domain encyclopaedia Wikipedia.

WorNet measures include Leacock & Chodorow (1998), Wu & Palmer(1994), Resnik(1995), Lesk and Banerjee & Pederson(2003)

Google based sr measure = 𝐻𝑖𝑡𝑠 ( 𝑖 𝐴𝑁𝐷 𝑗 )

𝐻𝑖𝑡𝑠 𝑖 +𝐻𝑖𝑡𝑠 𝑗 −𝐻𝑖𝑡𝑠 ( 𝑖 𝐴𝑁𝐷 𝑗)


P11Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts. It also shows that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. The best results on this dataset are obtained by integrating Google, WordNet and Wikipedia based measures.

Results : This work established that existing relatedness measures perform better using Wikipedia than a baseline given by Google count.

Contribution : Computing SR requires a semantic resource. Wordnet was the de-facto semantic resourse for calculating Sematic Relatedness until in 2006 the concept of using Wikipedia as a semantic source surfaced in this work of Strubeand Ponzetto.


P2 : Computing semantic relatedness using wikipedia-based explicit semantic analysis.Author : Gabrilovich, E., Markovitch, S

Aim : Create a semantic interpretation of words occurring as Wikipedia titles.

Approach : ESA represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia (i.e Wikipedia titles). Using machine learning techniques, explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts.

Assessing the relatedness of texts in this space is comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts.


P2

They build an inverted index, which maps each word into a list of concepts in which it appears. Given a text fragment, the semantic interpreter ranks all the Wikipedia concepts by their relevance to the fragment as a vector.

Semantic Relatedness (SR) of a pair of text fragments is the cosine metric between their vectors.

Results : Proposed SR measure, ESA achieved the highest correlation with human (0.75)

Contribution : SR measure achieved the highest correlation with human (0.75) thus far. However the method requires processing whole Wikipedia text.


P3 : An effective, low-cost measure of semantic relatedness obtained from Wikipedia links.Author : Milne, D., Witten, I.H.:

Aim : Proposed the Wikipedia Link based Measure (WLM) for computing Semantic Relatedness

Approach : Uses only the hyperlink structure of Wikipedia

Results : WLM achieved a correlation of 0.68 with human.

Contribution : The approach uses the hyperlink structure of Wikipedia rather than its category hierarchy( as in P11) or textual content(as in P2). Evaluation with manually defined measures of semantic relatedness reveals this approach to be an effective compromise between the ease of computation of the former(P11) and the accuracy of the latter(P2).

In their subsequent work this was expanded to measure relatedness between entities and used in Entity Linking


P3WLM is cheaper and effective: cheaper, because Wikipedia’s extensive textual content can largely be ignored, and effective, because it is more closely tied to the manually defined semantics of the resource.

Candidate articles of a term are identified using anchor .

anchors—the terms or phrases in Wikipedia articles texts to which links are attached.

Two SR measures

w(a ->b) = 𝑙𝑜𝑔|𝑇|

|𝑊|

sr(a,b) =log max 𝐴,𝐵 −log(𝐴∩𝐵)

log 𝑊 −log(min(𝐴,𝐵))


Literature Survey : Entity Linking in documentsP4. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In CIKM (2008)

P5. Cucerzan, S.: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In EMNLP-CoNLL(2007)

P6. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In SIGKDD (2009)


P4 : Learning to link with Wikipedia.Author : Milne, D., Witten, I.H.:

Aim : Automatically cross-reference documents with Wikipedia

Approach : Learn a disambiguator using WLM and Comonnness as features. This disambiguator is used to detect mentions and link them.

Results : Disambiguator (F=97.1), Link Detection (F = 75) , Accuracy of the detected links = 76.4%.

Contribution : The key difference in approach was that it uses disambiguation to inform detection, whereas conventional approach was to do detection first and then do disambiguation.


P4Disambiguation followed by mention detection!

Mention Detection : Document ->n-grams ->remove infrequent and stopwords

Disambiguation : Disambiguate the cleaned n-grams using two features :Relatedness or WLM and link probability [p1]. A linear combination is achieved by preferentially weighing WLM , when good context is available and link probability when less context is available(choose most common sense) using a classifier (B4.5 algorithm).

Linking : The features WLM, link probability, disambiguation score, Generality( min depth at which topic is located in the Wikipedia category tree), Location and spread of topics (in the Wikipedia page) are used to train a Naïve Bayes classifier to detect link (whether to link or not).


P5 : Large-Scale Named Entity Disambiguation Based on Wikipedia Data.Author : Cucerzan, S.

Aim : A large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from Wikipedia data and Web search results.

Approach : Maximize the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities.

Disambiguation score = arg𝑚𝑎𝑥 𝑛=1𝑁 𝑏𝑛 ∣ 𝐶, 𝑑 + 𝑛=1 𝑚=1

𝑁 𝑏𝑛 ∣ 𝑇, 𝑏𝑚 ∣ 𝑇

Result: The implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles.

Contribution : The first wikification system to map all named entities in a text simultaneously to trap the coherence among the entities, in the disambiguation of the detected entity mention. Evidence from the context of the mention is combined with that from the category tag of the mention to do disambiguation.


P5Contextual information extracted from Wikipedia includes

1. the known entities (most articles in Wikipedia are associated to an entity/concept),

2. their entity class when available (Person, Location, Organization, and Miscellaneous),

3. their known surface forms (terms that are used to mention the entities in text),

4. contextual evidence (words or other entities that describe or co-occur with an entity), and

5. category tags (which describe topics to which an entity belongs to).

Mention Detection – Sentence and Entity boundary identification and type(PER, ORG, LOC, OTH)

Disambiguation - a vectorial representation of the processed document is compared with the vectorial representations of the Wikipedia entities.


P5In mention detection, NEs are identified and the system retrieves all possible entity disambiguation of each NE.

Wikipedia contexts that occur in the document and their category tags are aggregated into a document vector, which is subsequently compared with the Wikipedia entity vector (of categories and contexts) of each possible entity disambiguation.

Choose the assignment of entities to surface forms that maximizes the similarity between the document vector and the entity vectors

Disambiguation score = arg𝑚𝑎𝑥 𝑛=1𝑁 𝑏𝑛 ∣ 𝐶, 𝑑 + 𝑛=1 𝑚=1

𝑁 𝑏𝑛 ∣ 𝑇, 𝑏𝑚 ∣ 𝑇


P6.Collective annotation of Wikipedia entities in web text.Author : Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.

Aim : Link entity mentions on Web pages to entities in Wikipedia.

Approach : This paper proposes a general collective disambiguation approach. On the premise that coherent documents refer to entities from one or a few related topics or domains, the authors propose formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. The proposed solution is based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters.

Result: In experiments involving over a hundred manually-annotated Web pages and tens of thousands of entity mentions, the approach significantly outperforms other existing algorithms.

Contribution : They built a manually curated dataset for evaluating EL. Achieved F1 = 69.69. Both P4 and P5 avoid direct joint optimization of all spot labels, which is done here. This system achieved higher disambiguation accuracy though at a higher computational cost.


P6Wikipedia is preprocessed so that each page corresponding to an entity γ is represented by four fields.

• Text from the first descriptive paragraph of γ.

• Text from the whole page for γ.

• Anchor text within Wikipedia for γ.

• Anchor text and five tokens around it.

Each field is turned into a bag (multiset) of words. Three text match scores are computed between a field of γ and s:

• Dot-product between word count vectors.

• Cosine similarity in TFIDF vector space.

• Jaccard similarity between word sets. So in all, we get 4 × 3 = 12 features.


P6

Coherence Score =

This optimization is converted first into a 0/1 integer linear program by considering NA. Then it is relaxed into a LP or using rounding.

1

𝑆 𝐶2

𝑠!=𝑠′Ɛ𝑆

𝑟 𝑦𝑠 , 𝑦𝑠′ +1

𝑆

𝑠Ɛ𝑆

𝑤 𝑓 𝑦𝑠


Literature Survey : Entity Linking in short textsP7. Ferragina, P., Scaiella, U.: TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In CIKM (2010)

P8. Meij, E., Weerkamp, W., de Rijke, M.: Adding Semantics to Microblog Posts. WSDM (2012)


P7: TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). Author : Ferragina, P., Scaiella, U.

Aim : Uses Wikipedia's anchor text to page mapping to address the problem of cross-referencing text fragments with Wikipedia pages. This way synonymy and polysemy issues are resolved accurately and efficiently.

Approach : Uses Keyphraseness (P1) for mention detection and WLM (P4) for disambiguation.

Result: good results on both long documents (F = 78.2) and short texts fragments(F=77.9) i.eweb snippets and micro-blogging (namely, tweets) .

Contribution : On short texts, context is sparse. This is counted as state-of-the-art in Wikification systems.


P71. Relatedness between pages

2. Disambiguation for a mention a from candidate sense Pa, rel(Pa).3. Linking :


P8: Adding Semantics to Microblog Posts.Author : Meij, E., Weerkamp, W., de Rijke, M..

Aim : Determine concept of a microblog post (tweet) through semantic linking.

Approach : Combine concept ranking method (which gives high Recall) which generate a ranked list of candidate concepts, with supervised machine learning method (that gives high Precision) to predict concept of tweet.

Result: Achieved MRR(0.708) on the published dataset.

Contribution : A reusable datset.


P8Approach :

1. Mention detection and link generation : obtain a ranked list of candidate concepts for each n-gram in a tweet.

2. Disambiguation : Determine which of the candidate concepts to keep ( a comparison of methods for the initial concept ranking step; lexical matching, language modelling and compare their effectiveness). Supervised Learner.

Features :

N-gram features : IDF(q), WIG(q), SNIL(q), SNCL, Link Probability, Keyphraseness

Concept features : Inlinks, Outlinks, Redirect, WikiCat

Tweet features : TWCT, TWCQ, URL, TAGDEF

Dataset : A manually annotated dataset of tweet –to- concept was created.


Literature Survey : Entity Linking EvaluationP9. Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-annotation systems. In WWW 2013

P10. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating entity linking with Wikipedia. Artif. Intell. (2013)


P9: A framework for benchmarking entity-annotation systems. Author : Cornolti, M., Ferragina, P., Ciaramita, M..

Aim : Presents a benchmarking framework for fair and exhaustive comparison of entity-annotation systems.

Approach : Definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available data-sets, containing texts of various types such as news, tweets and Web pages. Problems fall into D2W, A2W, Sa2W, C2W, Sc2W or Rc2W

Result: Comparison of publicly available entity annotation systems namely : AIDA, Illinois Wikifier, Tagme, Wikipedia-Miner and Dbpedia-Spotlight

Contribution : Classification of entity linking systems into D2W, A2W and the evaluation measures defined here became well accepted as standards.


P9


P10: Evaluating entity linking with Wikipedia. Author : Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.

Aim : This paper re-implements three seminal Named Entity Linking (NEL) systems and presents a detailed evaluation of mention detection strategies. The results are systematically compared on standard data sets. The results establish that co-reference and acronym handling lead to substantial improvement, and mention detection strategy account for much of the variation between systems.

Approach : Compare Bunescu & Pasca (P0), Cucerzan(P5) and Varma (IIITHyderabad at TAC 2009) systems

Result: First direct comparison of three systems.

Contribution : Mention detection strategies account for much of the variation between systems compared to disambiguation methods.


P10Review of Named Entity Disambiguation Tasks and Data Sets

NEL system = Extractor + Searcher + Disambiguator

Extractors – alias source

Searcher - effect of coreference, acronym handling and query length

Disambiguator –cosine similarity outdid scalar product and SVM rank


RECAP : Entity Linking and KB Enhancement

Mention DetectionKnowledgebase Construction


Entity Linking

HYPERLINKED TEXT

INPUT TEXT

Disambiguation


Literature Survey : Knowledgebase CreationP12. Zesch, T., Gurevych, I.: Analysis of the Wikipedia category graph for nlp applications. In TextGraphs-2 Workshop of NAACL-HLT(2007)

P13. Suchanek, F.M., Kasneci, G.,Weikum, G.: Yago: A core of semantic knowledge. In WWW (2007)


RECAP : Semantic Relatedness

Semantic relatedness: how much words/texts are correlated in meaning to each other.

word 1 /text 1 ←→ word 2 /text 2

cricket ←→ sport

Domesticated Animals ↔ Pet Mammals

Producers ↔ Actors ↔ Directors


RECAP : P11 : Wikirelate! computing semantic relatedness using wikipedia.Author : Strube, M., Ponzetto, S.P.


Approach : Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts. It also shows that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. The best results on this dataset are obtained by integrating Google, WordNet and Wikipedia based measures.

Results : This work established that existing relatedness measures perform better using Wikipedia than a baseline given by Google count.

Contribution : Computing SR requires a semantic resource. Wordnet was the de-facto semantic resourse for calculating Sematic Relatedness until in 2006 the concept of using Wikipedia as a semantic source surfaced in this work of Strube and Ponzetto.


P12 : Analysis of the Wikipedia category graph for NLP applications.Author : Zesch, T., Gurevych, I.


Approach : Compare the two graphs in Wikipedia (i) the article graph, and (ii) the category graph. Using graph theoretic analysis of the category graph, the authors show that Wikipedia Category Graph is a scale-free, small world graph like other well-known lexical semantic networks.

Results : WordNet based SR measures are adapted to Wikipedia Category Graph. German WordNet (a.k.a GermaNet) gives best correlation with human judgement on SR datasets.

Contribution : First published non-Englisg ( German ) SR dataset.


P12


P12WorNet measures include

Path Length, Leacock & Chodorow (1998), Wu & Palmer(1994), Resnik(1995), Lin(1998), IIC (Intrinsic Information Content)


P13 : Yago: A core of semantic knowledge.Author : Suchanek, F.M., Kasneci, G.,Weikum, G.

Aim : YAGO is a light-weight and extensible ontology with high coverage and quality. YAGO contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).

Approach : The facts were automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods.

Results : Empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS.

Contribution : First ontology using Wikipedia + WordNet


P13


Literature Survey : Knowledgebase EnhancementEntity Attribute Extraction

P14. Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. SIGKDD 2006

Structured Information Extraction

Wu, F., Weld, D.S.: Autonomously semantifying Wikipedia. In CIKM (2007)


Autonomously semantifying Wikipedia.Author : Wu, F., Weld, D.S.

Aim : Automatically enhance structures in Wikipedia like link structure, taxonomic data, infoboxes, etc..(KYLIN)

Approach: Uses a self-supervised, machine learning system. KYLIN looks for classes of pages with similar infoboxes, determines common attributes, creates training examples, learns CRF extractors, and runs them on each page — creating new infoboxes and completing others.

KYLIN also automatically identifies missing links for proper nouns on each page, resolving each to a unique identifier.

Results : Experiments show that the performance of KYLIN is roughly comparative with manual labelling in terms of precision and recall.

Contributions : System or API is not publically available.


KYLIN:KYLIN looks for classes of pages with similar infoboxes, determines common attributes, creates training examples, learns CRF extractors, and runs them on each page — creating new infoboxes and completing others.

KYLIN also automatically identifies missing links for proper nouns on each page, resolving each to a unique identifier.

Experiments show that the performance of KYLIN is roughly comparative with manual labelling in terms of precision and recall. On one domain, it does even better.


P14 : Text mining for product attribute extractionAuthor : Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.

Aim : Extracting attribute and value pairs from textual product descriptions. The goal is to augment databases of products by representing each product as a set of attribute-value pairs.

Approach: Problem is formulated as a classification problem and solved using semi-supervised learning algorithms.

Results : results on apparel and sporting goods product descriptions

Contribution : Representing product as A-V pairs is beneficial for tasks where treating the product as a set of attribute-value pairs is more useful than as an atomic entity. Used in applications like demand forecasting, assortment optimization, product recommendations, and assortment comparison across retailers and manufacturers.


P14First system extracts implicit (semantic) attributes that are implicitly mentioned in descriptions.

Semantic Attribute Extraction:

1. Dataset : crawled from apparel retail websites

2. Define a set of semantic attributes that would be useful to extract for each product

3. a small subset (600 products) was given to a group of fashionaware people to label

4. create one text classifier for each semantic attribute(Na¨ıve Bayes)

5. use the Expectation- Maximization algorithm to combine labeled and unlabeled data.


P14Second system extracts explicit attributes from product descriptions. These attributes associated with products are explicit physical attributes such as size and colour. The attribute-value pairs are explicitly mentioned in the data. Both the data populates a knowledge base with these products and attributes.

Explicit Attribute Extraction:

1. Data Collection from an internal database or from the web using web crawlers and wrappers, as done in the previous section.

2. Seed Generation either by generating them automatically or by obtaining human-labeled training data.

3. Attribute-Value Entity Extraction using a semi-supervised co-EM algorithm, because it can exploit the vast amounts of unlabelled data that can be collected cheaply.

4. Attribute-Value Pair Relationship Extraction by associating extracted attributes with corresponding extracted values. They use a dependency parser to establish links between attributes and values as well as correlation scores between words.

5. User Interaction to correct the results as well as to provide training data for the system to learn from using active learning techniques..


RECAP : Literature Survey1. Entity Linking








Applications

1. Information Extraction

2. Information Retrieval

3. Content Analysis

4. Question Answering

5. Knowledge Base Population


Our Attempts and Future Directions

1. Entity Linking

•Documents – TAC KBP Task 2014

•Tweets – NEEL Challenge @ WWW’14

•Search queries – ERD Challenge @ SIGIR’14



2. Semantic Relatedness – Using Wikipedia Category Network.

"Extracting Semantic Knowledge from Wikipedia Category Names " in Proceedings of the 3rd Workshop on Knowledge Extraction ( AKBC 2013) at CIKM 2013



3.Entity Attribute Extraction – From Product title

"Modeling Evolution of Product Entities" in proceedings of the ACM SIGIR 2014 Conference


SUMMARYIndia successfully sends 'MOM' to Mars


Mention Detection

Knowledgebase Construction


Entity Linking

HYPERLINKED TEXT

Disambiguation

INPUT TEXT

1. Entity Linking










Q&ATHANKYOU

Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on...

Documents

Transcript of Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on...