Post on 13-Jan-2016
Ontology-Driven Ontology-Driven Automatic Entity Automatic Entity
Disambiguation in Disambiguation in Unstructured TextUnstructured Text
Jed HassellJed Hassell
IntroductionIntroduction
►No explicit semantic information about No explicit semantic information about data and objects are presented in data and objects are presented in most of the Web pages.most of the Web pages.
►Semantic Web aims to solve this Semantic Web aims to solve this problem by providing an underlying problem by providing an underlying mechanism to add semantic metadata mechanism to add semantic metadata to content:to content: Ex: The entity “UGA” pointing to Ex: The entity “UGA” pointing to
http://www.uga.eduhttp://www.uga.edu Using entity disambiguationUsing entity disambiguation
IntroductionIntroduction
►We use background knowledge in the We use background knowledge in the form of an ontologyform of an ontology
►Our contributions are two-fold:Our contributions are two-fold: A novel method to disambiguate entities A novel method to disambiguate entities
within within unstructured textunstructured text by using clues in the by using clues in the text and exploiting metadata from the text and exploiting metadata from the ontology, ontology,
An implementation of our method that uses a An implementation of our method that uses a very large, real-world ontology to demonstrate very large, real-world ontology to demonstrate effective entity disambiguation in the domain effective entity disambiguation in the domain of Computer Science researchers.of Computer Science researchers.
BackgroundBackground
►Sesame RepositorySesame Repository Open source RDF repositoryOpen source RDF repository We chose Sesame, as opposed to Jena We chose Sesame, as opposed to Jena
and BRAHMS, because of its ability to and BRAHMS, because of its ability to store large amounts of information by not store large amounts of information by not being dependant on memory storage being dependant on memory storage alonealone
We chose to use Sesame’s native mode We chose to use Sesame’s native mode because our dataset is typically too large because our dataset is typically too large to fit into memory and using the database to fit into memory and using the database option is too slow in update operationsoption is too slow in update operations
Dataset 1: DBLP OntologyDataset 1: DBLP Ontology
► DBLP is a website that contains bibliographic DBLP is a website that contains bibliographic information for computer scientists, journals information for computer scientists, journals and proceedings:and proceedings: 3,079,414 entities (447,121 are authors)3,079,414 entities (447,121 are authors) We used a SAX parser to parse DBLP XML file that We used a SAX parser to parse DBLP XML file that
is available onlineis available online Created relationships such as “co-author”Created relationships such as “co-author” Added information regarding affiliationsAdded information regarding affiliations Added information regarding areas of interestAdded information regarding areas of interest Added alternate spellings for international Added alternate spellings for international
characterscharacters
Dataset 2: DBWorld PostsDataset 2: DBWorld Posts
►DBWorldDBWorld Mailing list of information for upcoming Mailing list of information for upcoming
conferences related to the databases fieldconferences related to the databases field Created a HTML scraper that downloads Created a HTML scraper that downloads
everything with “Call for Papers”, “Call for everything with “Call for Papers”, “Call for Participation” or “CFP” in its subjectParticipation” or “CFP” in its subject
Unstructured textUnstructured text
Overview of System Overview of System ArchitectureArchitecture
ApproachApproach
►Entity NamesEntity Names Entity attribute that represents the name Entity attribute that represents the name
of the entityof the entity Can contain more than one nameCan contain more than one name
ApproachApproach
► Text-proximity RelationshipsText-proximity Relationships Relationships that can be expected to be in text-Relationships that can be expected to be in text-
proximity of the entityproximity of the entity Nearness measured in character spacesNearness measured in character spaces
ApproachApproach
► Text Co-occurrence RelationshipsText Co-occurrence Relationships Similar to text-proximity relationships except Similar to text-proximity relationships except
proximity is not relevantproximity is not relevant
ApproachApproach
►Popular EntitiesPopular Entities The intuition behind this is to specify The intuition behind this is to specify
relationships that will bias the right entity relationships that will bias the right entity to be the most popular entityto be the most popular entity
This should be used with care, depending This should be used with care, depending on the domainon the domain
DBLP ex: the number of papers the entity DBLP ex: the number of papers the entity has authoredhas authored
ApproachApproach
► Semantic RelationshipsSemantic Relationships Entities can be related to one another through Entities can be related to one another through
their collaboration networktheir collaboration network DBLP ex: Entities are related to one another DBLP ex: Entities are related to one another
through co-author relationshipsthrough co-author relationships
AlgorithmAlgorithm
► Idea is to spot entity names in text Idea is to spot entity names in text and assign each potential match a and assign each potential match a confidence scoreconfidence score
►This confidence score will be adjusted This confidence score will be adjusted as the algorithm progresses and as the algorithm progresses and represents the certainty that this represents the certainty that this spotted entity represents a particular spotted entity represents a particular object in the ontologyobject in the ontology
Algorithm – Flow ChartAlgorithm – Flow Chart
StartSpot entity
namesFound?
Do nothing
Initiate confidence
score and store in Candidate
Entities
More entities?
no
yes
Yes
Spot text-proximity
relationships
no
Found?Adjust
confidence score
Do nothingMore
candidate entities?
yes
no
yes
Algorithm – Flow ChartAlgorithm – Flow Chart
Spot text co-occurrence
relationshipsFound?
Adjust confidence
score
Do nothingMore
candidate Entities?
yes
no
yes
Adjust confidence score based on
number of popular entity relationships
Search for semantic
relationshipsFound?
Adjust confidence
score
No changeMore
candidate entities?
no
no
yes
yes
Candidate entity rise above threshold?
no Endno
Yes (Iterative Step)
AlgorithmAlgorithm
► Spotting Entity NamesSpotting Entity Names Search document for entity names within the Search document for entity names within the
ontologyontology Each of the entities in the ontology that match a Each of the entities in the ontology that match a
name found in the document become a name found in the document become a candidate entitycandidate entity
Assign initial confidence scores for candidate Assign initial confidence scores for candidate entities based on these formulas:entities based on these formulas:
AlgorithmAlgorithm
►Spotting Literal Values of Text-Spotting Literal Values of Text-proximity Relationshipsproximity Relationships Only consider relationships from Only consider relationships from
candidate entitiescandidate entities Substantially increase confidence score if Substantially increase confidence score if
within proximitywithin proximity Ex: Entity affiliation found next to entity Ex: Entity affiliation found next to entity
namename
AlgorithmAlgorithm
►Spotting Literal Values of Text Co-Spotting Literal Values of Text Co-occurrence Relationshipsoccurrence Relationships Only consider relationships from Only consider relationships from
candidate entitiescandidate entities Increase confidence score if found within Increase confidence score if found within
the document (location does not matter)the document (location does not matter) Ex: Entity’s areas of interest found in the Ex: Entity’s areas of interest found in the
documentdocument
AlgorithmAlgorithm
►Using Popular EntitiesUsing Popular Entities Slightly increase the confidence score of Slightly increase the confidence score of
candidate entities based on the amount of candidate entities based on the amount of popular entity relationshipspopular entity relationships
Valuable when used as a tie-breakerValuable when used as a tie-breaker Ex: Candidate entities with more than 15 Ex: Candidate entities with more than 15
publications receive a slight increase in publications receive a slight increase in their confidence scoretheir confidence score
AlgorithmAlgorithm
►Using Semantic RelationshipsUsing Semantic Relationships Use relationships among entities to boost Use relationships among entities to boost
confidence scores of candidate entitiesconfidence scores of candidate entities Each candidate entity with a confidence Each candidate entity with a confidence
score above the score above the thresholdthreshold is analyzed for is analyzed for semantic relationships to other candidate semantic relationships to other candidate entities. If another candidate entity is entities. If another candidate entity is found and is below the found and is below the thresholdthreshold, that , that entity’s confidence score is increasedentity’s confidence score is increased
AlgorithmAlgorithm
► If any candidate entity rises above the If any candidate entity rises above the thresholdthreshold, the process repeats until , the process repeats until the algorithm stabilizesthe algorithm stabilizes
►This is an iterative step and always This is an iterative step and always convergesconverges
OutputOutput
►XML formatXML format URI – the DBLP URL of the entityURI – the DBLP URL of the entity Entity nameEntity name Confidence scoreConfidence score Character offset – the location of the Character offset – the location of the
entity in the documententity in the document►This is a generic output and can easily This is a generic output and can easily
be converted for use in Microformats, be converted for use in Microformats, RDFa, etc.RDFa, etc.
OutputOutput
Output - MicroformatOutput - Microformat
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
►We evaluate our system using a gold We evaluate our system using a gold standard set of documentsstandard set of documents 20 manually disambiguated documents20 manually disambiguated documents Randomly chose 20 consecutive post from Randomly chose 20 consecutive post from
DBWorldDBWorld We use We use precisionprecision and and recallrecall as the as the
measurement of evaluation for our measurement of evaluation for our systemsystem
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
Evaluation: Precision & Evaluation: Precision & RecallRecall
►We define set We define set AA as the set of unique as the set of unique names identified using the names identified using the disambiguated datasetdisambiguated dataset
►We define set We define set BB as the set of entities as the set of entities found by our methodfound by our method
►The intersection of these sets The intersection of these sets represents the set of entities correctly represents the set of entities correctly identified by our methodidentified by our method
Evaluation: Precision & Evaluation: Precision & RecallRecall
► Precision is the Precision is the proportion of correctly proportion of correctly disambiguated entities disambiguated entities with regard to with regard to BB
► Recall is the proportion Recall is the proportion of correctly of correctly disambiguated entities disambiguated entities with regard to with regard to AA
Evaluation: ResultsEvaluation: Results► Precision and recall when compared to Precision and recall when compared to
entire gold standard set:entire gold standard set:
► Precision and recall on a per document Precision and recall on a per document basis:basis:
Correct Disambiguation Found Entities Total Entities Precision Recall
602 620 758 97.1% 79.4%
Precision and Recall
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Documents
Per
cen
tag
e
Recall
Precision
Related WorkRelated Work
►Semex:Semex: Personal information management system Personal information management system
that works with a user’s desktopthat works with a user’s desktop Takes advantage of a predictable Takes advantage of a predictable
structurestructure The results of disambiguated entities are The results of disambiguated entities are
propagated to other ambiguous entities, propagated to other ambiguous entities, which could then be reconciled based on which could then be reconciled based on recently reconciled entities much like our recently reconciled entities much like our work doeswork does
Related WorkRelated Work
►Kim:Kim: An application that aims to be an An application that aims to be an
automatic ontology populationautomatic ontology population Contains an entity recognition portion that Contains an entity recognition portion that
uses natural language processorsuses natural language processors Evaluations performed on human Evaluations performed on human
annotated corporaannotated corpora Missed a lot of entities and results had Missed a lot of entities and results had
many false positivesmany false positives
ConclusionConclusion
►Our method uses relationships Our method uses relationships between entities in the ontology to go between entities in the ontology to go beyond traditional syntactic-based beyond traditional syntactic-based disambiguation techniquesdisambiguation techniques
►This work is among the first to This work is among the first to successfully use relationships for successfully use relationships for identifying entities in text without identifying entities in text without relying on the structure of the textrelying on the structure of the text
Thank you!Thank you!