Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Ontology-Driven Ontology-Driven Automatic Entity Automatic Entity

Disambiguation in Disambiguation in Unstructured TextUnstructured Text

Jed HassellJed Hassell

IntroductionIntroduction

►No explicit semantic information about No explicit semantic information about data and objects are presented in data and objects are presented in most of the Web pages.most of the Web pages.

►Semantic Web aims to solve this Semantic Web aims to solve this problem by providing an underlying problem by providing an underlying mechanism to add semantic metadata mechanism to add semantic metadata to content:to content: Ex: The entity “UGA” pointing to Ex: The entity “UGA” pointing to

http://www.uga.eduhttp://www.uga.edu Using entity disambiguationUsing entity disambiguation

IntroductionIntroduction

►We use background knowledge in the We use background knowledge in the form of an ontologyform of an ontology

►Our contributions are two-fold:Our contributions are two-fold: A novel method to disambiguate entities A novel method to disambiguate entities

within within unstructured textunstructured text by using clues in the by using clues in the text and exploiting metadata from the text and exploiting metadata from the ontology, ontology,

An implementation of our method that uses a An implementation of our method that uses a very large, real-world ontology to demonstrate very large, real-world ontology to demonstrate effective entity disambiguation in the domain effective entity disambiguation in the domain of Computer Science researchers.of Computer Science researchers.

BackgroundBackground

►Sesame RepositorySesame Repository Open source RDF repositoryOpen source RDF repository We chose Sesame, as opposed to Jena We chose Sesame, as opposed to Jena

and BRAHMS, because of its ability to and BRAHMS, because of its ability to store large amounts of information by not store large amounts of information by not being dependant on memory storage being dependant on memory storage alonealone

We chose to use Sesame’s native mode We chose to use Sesame’s native mode because our dataset is typically too large because our dataset is typically too large to fit into memory and using the database to fit into memory and using the database option is too slow in update operationsoption is too slow in update operations

Dataset 1: DBLP OntologyDataset 1: DBLP Ontology

► DBLP is a website that contains bibliographic DBLP is a website that contains bibliographic information for computer scientists, journals information for computer scientists, journals and proceedings:and proceedings: 3,079,414 entities (447,121 are authors)3,079,414 entities (447,121 are authors) We used a SAX parser to parse DBLP XML file that We used a SAX parser to parse DBLP XML file that

is available onlineis available online Created relationships such as “co-author”Created relationships such as “co-author” Added information regarding affiliationsAdded information regarding affiliations Added information regarding areas of interestAdded information regarding areas of interest Added alternate spellings for international Added alternate spellings for international

characterscharacters

Dataset 2: DBWorld PostsDataset 2: DBWorld Posts

►DBWorldDBWorld Mailing list of information for upcoming Mailing list of information for upcoming

conferences related to the databases fieldconferences related to the databases field Created a HTML scraper that downloads Created a HTML scraper that downloads

everything with “Call for Papers”, “Call for everything with “Call for Papers”, “Call for Participation” or “CFP” in its subjectParticipation” or “CFP” in its subject

Unstructured textUnstructured text

Overview of System Overview of System ArchitectureArchitecture

ApproachApproach

►Entity NamesEntity Names Entity attribute that represents the name Entity attribute that represents the name

of the entityof the entity Can contain more than one nameCan contain more than one name

ApproachApproach

► Text-proximity RelationshipsText-proximity Relationships Relationships that can be expected to be in text-Relationships that can be expected to be in text-

proximity of the entityproximity of the entity Nearness measured in character spacesNearness measured in character spaces

ApproachApproach

► Text Co-occurrence RelationshipsText Co-occurrence Relationships Similar to text-proximity relationships except Similar to text-proximity relationships except

proximity is not relevantproximity is not relevant

ApproachApproach

►Popular EntitiesPopular Entities The intuition behind this is to specify The intuition behind this is to specify

relationships that will bias the right entity relationships that will bias the right entity to be the most popular entityto be the most popular entity

This should be used with care, depending This should be used with care, depending on the domainon the domain

DBLP ex: the number of papers the entity DBLP ex: the number of papers the entity has authoredhas authored

ApproachApproach

► Semantic RelationshipsSemantic Relationships Entities can be related to one another through Entities can be related to one another through

their collaboration networktheir collaboration network DBLP ex: Entities are related to one another DBLP ex: Entities are related to one another

through co-author relationshipsthrough co-author relationships

AlgorithmAlgorithm

► Idea is to spot entity names in text Idea is to spot entity names in text and assign each potential match a and assign each potential match a confidence scoreconfidence score

►This confidence score will be adjusted This confidence score will be adjusted as the algorithm progresses and as the algorithm progresses and represents the certainty that this represents the certainty that this spotted entity represents a particular spotted entity represents a particular object in the ontologyobject in the ontology

Algorithm – Flow ChartAlgorithm – Flow Chart

StartSpot entity

namesFound?

Do nothing

Initiate confidence

score and store in Candidate

Entities

More entities?

Spot text-proximity

relationships

Found?Adjust

confidence score

Do nothingMore

candidate entities?

Algorithm – Flow ChartAlgorithm – Flow Chart

Spot text co-occurrence

relationshipsFound?

Adjust confidence

Do nothingMore

candidate Entities?

Adjust confidence score based on

number of popular entity relationships

Search for semantic

relationshipsFound?

Adjust confidence

No changeMore

candidate entities?

Candidate entity rise above threshold?

no Endno

Yes (Iterative Step)

AlgorithmAlgorithm

► Spotting Entity NamesSpotting Entity Names Search document for entity names within the Search document for entity names within the

ontologyontology Each of the entities in the ontology that match a Each of the entities in the ontology that match a

name found in the document become a name found in the document become a candidate entitycandidate entity

Assign initial confidence scores for candidate Assign initial confidence scores for candidate entities based on these formulas:entities based on these formulas:

AlgorithmAlgorithm

►Spotting Literal Values of Text-Spotting Literal Values of Text-proximity Relationshipsproximity Relationships Only consider relationships from Only consider relationships from

candidate entitiescandidate entities Substantially increase confidence score if Substantially increase confidence score if

within proximitywithin proximity Ex: Entity affiliation found next to entity Ex: Entity affiliation found next to entity

namename

AlgorithmAlgorithm

►Spotting Literal Values of Text Co-Spotting Literal Values of Text Co-occurrence Relationshipsoccurrence Relationships Only consider relationships from Only consider relationships from

candidate entitiescandidate entities Increase confidence score if found within Increase confidence score if found within

the document (location does not matter)the document (location does not matter) Ex: Entity’s areas of interest found in the Ex: Entity’s areas of interest found in the

documentdocument

AlgorithmAlgorithm

►Using Popular EntitiesUsing Popular Entities Slightly increase the confidence score of Slightly increase the confidence score of

candidate entities based on the amount of candidate entities based on the amount of popular entity relationshipspopular entity relationships

Valuable when used as a tie-breakerValuable when used as a tie-breaker Ex: Candidate entities with more than 15 Ex: Candidate entities with more than 15

publications receive a slight increase in publications receive a slight increase in their confidence scoretheir confidence score

AlgorithmAlgorithm

►Using Semantic RelationshipsUsing Semantic Relationships Use relationships among entities to boost Use relationships among entities to boost

confidence scores of candidate entitiesconfidence scores of candidate entities Each candidate entity with a confidence Each candidate entity with a confidence

score above the score above the thresholdthreshold is analyzed for is analyzed for semantic relationships to other candidate semantic relationships to other candidate entities. If another candidate entity is entities. If another candidate entity is found and is below the found and is below the thresholdthreshold, that , that entity’s confidence score is increasedentity’s confidence score is increased

AlgorithmAlgorithm

► If any candidate entity rises above the If any candidate entity rises above the thresholdthreshold, the process repeats until , the process repeats until the algorithm stabilizesthe algorithm stabilizes

►This is an iterative step and always This is an iterative step and always convergesconverges

OutputOutput

►XML formatXML format URI – the DBLP URL of the entityURI – the DBLP URL of the entity Entity nameEntity name Confidence scoreConfidence score Character offset – the location of the Character offset – the location of the

entity in the documententity in the document►This is a generic output and can easily This is a generic output and can easily

be converted for use in Microformats, be converted for use in Microformats, RDFa, etc.RDFa, etc.

OutputOutput

Output - MicroformatOutput - Microformat

Evaluation: Gold Standard Evaluation: Gold Standard SetSet

►We evaluate our system using a gold We evaluate our system using a gold standard set of documentsstandard set of documents 20 manually disambiguated documents20 manually disambiguated documents Randomly chose 20 consecutive post from Randomly chose 20 consecutive post from

DBWorldDBWorld We use We use precisionprecision and and recallrecall as the as the

measurement of evaluation for our measurement of evaluation for our systemsystem

Evaluation: Gold Standard Evaluation: Gold Standard SetSet

Evaluation: Precision & Evaluation: Precision & RecallRecall

►We define set We define set AA as the set of unique as the set of unique names identified using the names identified using the disambiguated datasetdisambiguated dataset

►We define set We define set BB as the set of entities as the set of entities found by our methodfound by our method

►The intersection of these sets The intersection of these sets represents the set of entities correctly represents the set of entities correctly identified by our methodidentified by our method

Evaluation: Precision & Evaluation: Precision & RecallRecall

► Precision is the Precision is the proportion of correctly proportion of correctly disambiguated entities disambiguated entities with regard to with regard to BB

► Recall is the proportion Recall is the proportion of correctly of correctly disambiguated entities disambiguated entities with regard to with regard to AA

Evaluation: ResultsEvaluation: Results► Precision and recall when compared to Precision and recall when compared to

entire gold standard set:entire gold standard set:

► Precision and recall on a per document Precision and recall on a per document basis:basis:

Correct Disambiguation Found Entities Total Entities Precision Recall

602 620 758 97.1% 79.4%

Precision and Recall

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Documents

Recall

Precision

Related WorkRelated Work

►Semex:Semex: Personal information management system Personal information management system

that works with a user’s desktopthat works with a user’s desktop Takes advantage of a predictable Takes advantage of a predictable

structurestructure The results of disambiguated entities are The results of disambiguated entities are

propagated to other ambiguous entities, propagated to other ambiguous entities, which could then be reconciled based on which could then be reconciled based on recently reconciled entities much like our recently reconciled entities much like our work doeswork does

Related WorkRelated Work

►Kim:Kim: An application that aims to be an An application that aims to be an

automatic ontology populationautomatic ontology population Contains an entity recognition portion that Contains an entity recognition portion that

uses natural language processorsuses natural language processors Evaluations performed on human Evaluations performed on human

annotated corporaannotated corpora Missed a lot of entities and results had Missed a lot of entities and results had

many false positivesmany false positives

ConclusionConclusion

►Our method uses relationships Our method uses relationships between entities in the ontology to go between entities in the ontology to go beyond traditional syntactic-based beyond traditional syntactic-based disambiguation techniquesdisambiguation techniques

►This work is among the first to This work is among the first to successfully use relationships for successfully use relationships for identifying entities in text without identifying entities in text without relying on the structure of the textrelying on the structure of the text

Thank you!Thank you!

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Documents

Transcript of Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Hassell Sky Central Drawings - Architecture Today

HASSELL - Major Projects

Internet:Internet (Disambiguation).

Jed presentation

2017 Fundraise for JED Toolkit€¦ · 2017 Fundraise for JED Toolkit We are thrilled to introduce our new Fundraise for JED program. Now you can raise funds for The Jed Foundation

Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

Entity Disambiguation

Jed Quickref

Jed McKenna Quotes

Sculpture - sweethaven02.com · Sculpture “Sculptor”redirectshere. Forotheruses,seeSculptor (disambiguation)andSculpture(disambiguation). Sculptureisthebranchofthevisualartsthatoperates

EMERGENCY LIGHTING SOLUTIONS Emergency Light.pdf · jed el 251-ir led jed el 251-ir jed el 210-h jed el 252 is jed el 252 is led jed el 211 for wet location model no. jed ex 511 led

PETSc Tutorial - Jed Brown Tutorial Jed Brown1 & Kevin ... ( Jed Brown & Kevin Green ... Building Blocks of the Code …

Jed David_Portfolio

THE JED FOUNDATION

Random Disambiguation Paths

Hassell v. Bird

Peter Hassell

Jed foundation

HASSELL - majorprojects.planningportal.nsw.gov.au

Jed Williams Presentation