How to Make Your Content Smarter

Enabling Networked Knowledge

Digital Enterprise Research Institute

Entity Detection and Consolidation: How to Make Your Content Smarter?

Bianca Pereira, Paul Buitelaar Unit for Natural Language Processing

Digital Enterprise Research Institute, National University of Ireland, Galway

Acknowledgements: This work has been funded by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).

Motivation:

Information available online can be acquired both through human reading and computer processing. Despite this, the majority of data on the Web does not allow both types of reading.

Research Questions:

•  How to identify entity mentions from text?

Entity Detection

•  How to identify which is the real-world entity mentioned on the text? And find the same entity through diverse texts?

Entity Consolidation

Research Contribution:

•  Quality assessment of some linked data datasets currently available on the Web.

•  Identification of common classes and properties used for Named Entities (entities identified by Proper Names) in Linked Data datasets.

•  Development of a framework adaptable to different linked data datasets.

Aim:

Link human readable and computer processing content in order to enable machine understanding of the content of a given text and enable humans to track entities across texts.

Proposed Solution:

•  Identification of different mentions to real-world entities in natural language text and their unified, non-ambiguos linking to an external database.

•  Use the available, and growing, linked data cloud as background database.

• Development of AELA, a framework for entity detection and consolidation.

Future Research:

•  Detection of entities mentioned by generalized names (genes, diseases or words such as ambulance, coffee machine, airplane, etc.).

•  Application of AELA in texts in different domains.

•  Evaluation of other current methods when applied to AELA.

AELA:

AELA Framework

•  Experiments on films and music domains.

•  Adaptive to the semantic structure of the Linked Data (LD) dataset.

Preliminary Results:

•  Music Domain (Jamendo dataset)

F-Score: 0.54

•  Films Domain (Linked Movie Database dataset)

F-Score: 0.87

How to Make Your Content Smarter

Technology

Transcript of How to Make Your Content Smarter