The Semantic Web: there and back again Tim Finin University of Maryland, Baltimore County Joint work...

Post on 14-Jan-2016

215 views 2 download

Transcript of The Semantic Web: there and back again Tim Finin University of Maryland, Baltimore County Joint work...

The Semantic Web:there and back again

Tim FininUniversity of Maryland, Baltimore County

Joint work with Lushan Han, Varish Mulwad, Anupam Joshi

http://ebiq.org/r/353

LOD 123: Making the semantic web easier to use

Tim FininUniversity of Maryland, Baltimore County

Joint work with Lushan Han, Varish Mulwad, Anupam Joshi

Semantic Web: then and now

• Ten years ago the we developed complex ontologies used to encode and reason over small datasets of 1000s of facts

• Recently the focus has shifted to using simple ontologies and minimal reasoning over very large datasets of 100s of millions of facts

• Major companies are moving: Google Know-ledge Graph, Facebook Open Graph, Microsoft Satori, Apple Siri KB, IMB Watson KB Linked open data or “Things, not strings”

Linked Open Data (LOD)• Linked data is just RDF data, lots of it,

with a small schema• RDF data is a graph of triples (subject, predicate

– URI URI String: dbr:Barack_Obama dbo:spouse “Michelle Obama”

– URI URI URI: dbr:Barack_Obama dbo:spouse dbpedia:Michelle_Obama

• Best linked data practice prefers 2nd pattern, using nodes rather than strings for “entities”– Things, not strings!

• Linked open data is just linked data freely acces-sible on the Web along with their ontologies

Semantic web technologies allow machines to share data and knowledge using common web language and protocols.

~ 1997

Semantic Web

Semantic Web beginning

Use Semantic Web Technology to publish shared data & knowledge

2007

Semantic Web => Linked Open DataUse Semantic Web Technology to publish shared data & knowledge

Data is inter-linked to support inte-gration and fusion of knowledge

LOD beginning

2008

Semantic Web => Linked Open DataUse Semantic Web Technology to publish shared data & knowledge

Data is inter-linked to support inte-gration and fusion of knowledge

LOD growing

2009

Semantic Web => Linked Open DataUse Semantic Web Technology to publish shared data & knowledge

Data is inter-linked to support inte-gration and fusion of knowledge

… and growing

Linked Open Data

2010

LOD is the new Cyc: a common source of background

knowledge

Use Semantic Web Technology to publish shared data & knowledge

Data is inter-linked to support inte-gration and fusion of knowledge

…growing faster

Linked Open Data

2011: 31B facts in 295 datasets interlinked by 504M assertions on ckan.net

LOD is the new Cyc: a common source of background

knowledge

Use Semantic Web Technology to publish shared data & knowledge

Data is inter-linked to support inte-gration and fusion of knowledge

Exploiting LOD not (yet) Easy

• Publishing or using LOD data hasinherent difficulties for the potential user

– It’s difficult to explore LOD data and to query it for answers

– It’s challenging to publish data using appropriate LOD vocabularies & link it to existing data

• Problem: O(104) schema terms, O(1011) instances

• I’ll describe two ongoing research projects that are addressing these problems

GoRelations:Intuitive Query System

for Linked Data

Research with Lushan Han

http://ebiq.org/j/93

Dbpedia is the Stereotypical LOD•DBpedia is an important example of Linked Open Data–Extracts structured data from Infoboxes in Wikipedia –Stores in RDF using custom ontologies Yago terms

•The major integration point for the entire LOD cloud•Explorable as HTML, but harder to query in SPARQL

DBpedia

Browsing DBpedia’s

Mark Twain

Why it’s hard to query LOD• Querying DBpedia requires a lot of a user– Understand the RDF model– Master SPARQL, a formal query language– Understand ontology terms: 320 classes & 1600 properties !– Know instance URIs (>2M entities !)– Term heterogeneity (Place vs. PopulatedPlace)

• Querying large LODsets overwhelming

• Natural languagequery systems stilla research goal

Goal

• Let users with a basic understanding of RDF query DBpedia and other LOD collections– Explore what data is in the system– Get answers to question– Create SPARQL queries for reuse or adaptation

• Desiderata– Easy to learn and to use– Good accuracy (e.g., precision and recall)– Fast

Key Idea

Structured keyword queries reduce problem complexity:

– User enters a simple graph, and– Annotates the nodes and arcs with

words and phrases

Structured Keyword Queries

• Nodes are entities and links binary relations• Entities described by two unrestricted terms:

name or value and type or concept• Outputs marked with ? • Compromise between a natural language Q&A

system and formal query–Users provide compositional structure of the question–Free to use their own terms to annotate structure

Translation – Step Onefinding semantically similar ontology terms

For each graph concept/relation, generate k most semantically similar ontology classes/properties

Lexical similarity metric based on distributional similarity, LSA, and WordNet

Semantic similarity: http://bit.ly/SEMSIM

Semantic Textual Similarity task• 2013 lexical and computational semantics conference• Do two sentences have same meaning (0…5)

1: “The woman is playing the violin” vs. “The young lady enjoys listening to the guitar”4: "In May 2010, the troops attempted to invade Kabul” vs. "The US army invaded Kabul on May 7th last year, 2010"

• 2012: 35 teams, 88 runs, 2013: 36 teams, 89 runs• 2250 sentence pairs

from four domains• Our three runs

#1, #2 and #4

Dataset PairingWords Galactus Saiyan

Headlines (750 pairs) 0.7642 (3) 0.7428 (7) 0.7838 (1)

OnWN (561 pairs) 0.7529 (5) 0.7053 (12) 0.5593 (36)

FNWN (189 pairs) 0.5818 (1) 0.5444 (3) 0.5815 (2)

SMT (750 pairs) 0.3804 (8) 0.3705 (11) 0.3563 (16)

Weighted mean 0.6181 (1) 0.5927 (2) 0.5683 (4)

• Assemble best interpretation using statistics of the data

• Use pointwise mutual informa-tion (PMI) between RDF terms in the LOD collection

Measures degree to which two RDF terms co-occur in knowledge base

• In a good interpretation, ontology terms associate like their corresponding user terms connect in the query

Translation – Step Twodisambiguation algorithm

Three aspects are combined to derive an overall goodness measure for each candidate interpretation

Translation – Step Twodisambiguation algorithm

Joint disam-biguation

Resolvingdirection

Link reason-ableness

Translation resultConcepts: Place => Place, Author => Writer, Book => BookProperties: born in => birthPlace, wrote => author (inverse direction)

The translation of a semantic graph query to SPARQL is straightforward given the mappings

SPARQL Generation

Concepts•Place => Place•Author => Writer•Book => Book

Relations•born in => birthPlace•wrote => author

Evaluation• 33 test questions from 2011 Workshop on Question

Answering over Linked Data answerable using DBpedia• Three human subjects unfamiliar with DBpedia translated

the test questions into semantic graph queries• Compared with two top natural language QA systems:

PowerAqua and True Knowledge

Current work

• Baseline system works well for Dbpedia; we’re testing a second use case now

• Current work– Better entity matching– Relaxing the need for type information– A better Web interface with user feedback & advice

• See http://ebiq.org/93 for more information & try our alpha version at http://ebiq.org/GOR

Generating Linked Databy Inferring the

Semantics of Tables

Research with Varish Mulwad

http://ebiq.org/j/96

Goal: Table => LOD*

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

http://dbpedia.org/class/yago/NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iverson Player height in meters

dbprop:team

* DBpedia

Goal: Table => LOD*

Name Team Position HeightMichael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbo:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbo:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

RDF Linked Data

All this in a completely automated way* DBpedia

Tables are everywhere !! … yet …

The web – 154 million high quality relational tables

Fewer than 1% of the 400K tables at data.gov have rich semantic schemas

Sampling Acronym detection

Pre-processing modules

Query and generate initial mappings

2 1

Generate Linked RDF Verify (optional) Store in a knowledge base & publish as LOD

Joint Inference/Assignment

A Domain Independent Framework

Query and Rank

Rank(String Similarity,

Popularity)Chicago

Boston

Allen Iverson 1. Chicago_Bulls2. Chicago3. Judy_Chicago

possible entitiesfor Chicago

Can be replaced by Domain Specific / other LOD knowledge bases

Generating candidate ‘types’ for Columns

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

1. Chicago Bulls2. Chicago3. Judy Chicago

{dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams }

{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. }

{……………………………………………………………. }

dbpedia-owl:Place, dbpedia-owl:City, yago:WomenArtist, yago:LivingPeople, yago:NationalBasketballAssociationTeams, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film ….

Joint Inference / Assignment

A graphical model for tablesJoint inference over evidence in a table

C1 C2 C3

R11

R12

R13

R21

R22

R23

R31

R32

R33

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

Parameterized Graphical Model

C1 C2C3

𝝍𝟓

R11 R12 R13 R21 R22 R23 R31 R32 R33

Function capturing affinity between column headers and row values

Row value

Variable Node: Column header

Captures interaction between column headers

Factor Node

Captures interaction between row values

Standard message passingP(C1, C2, C3, R11, R12 ,R13, R21, R22, R23, R31, R32, R33)Joint Assignment :

P(C1, R11, R12 ,R13)

P(C3, R31, R32, R33)

P(C2,R21, R22, R23)

P(R31, R32, R33)

Graphical Models : Exploit Conditional

Independences

Still …

C1 R11 R12 R13 Val

4 Variables; Each having

25 options -- 390,625

entries !

Semantic message passing

Michael_I_Jordan Chicago_BullsShooting_guard

“Change”Related to Chicago_Bulls

“No Change”“No Change”

R11:[Michael_I_Jordan]

R12:[Yao_Ming]

R13:[Allen_Iverson]

R21:[Chicago_Bulls]

R31:[Shooting_Guard]

……

C1:[BasketballPlayer] C2:[NBATeam] C3:[BasketBallPositions]

Yao_Ming Allen_Iverson

BasketballPlayer

NBATeam BasketBallPositions

“Change”BasketBall

Player

“No Change”

“No Change”

“No Change”“No Change”

“No Change”

“No Change”

Inference – ExampleR11:

[Michael_I_Jordan]

R12:[Allen_Iverson] R13:[Yao_Ming]

C1:[Name]

(Michael_I_Jordan, Yao_Ming, Allen_Iverson)

“Change”“No Change”“No Change”

“BasketBallPlayer”

R11

Michael Jordan

1. Michael_I_Jordan (Professor)2. ….. 3. Michael_Jordan (BasketballPlayer)

….

Michael_I_Jordan Allen_Iverson Yao_Ming

“BasketBallPlayer”

Column header – row value agreement[Michael_I_Jordan, Allen_Iverson, Yao_Ming]

WomenArtistCityPopulatedPlaceAthleteFilm

LivingPeopleGeoPopulatedPlaceBasketBallPlayerArtWork

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming BasketballPlayerAtheleteLivingPeople

LivingPeopleAI_Researchers +1

+1

+1

+1

+1 1. Athlete2. City 3. …

1.LivingPeople2. BasketBallPlayer3.GeoPopulatedPlace….

Yago Tie-breaker/Re-order : Choose more ‘descriptive’ class. E.g. BasketBallPlayer better than LivingPeople

ClassGranularityScore = 1-[]

1: Majority Voting

2: Choose the top Class

Top Yago : BasketBallPlayer TopDBpedia : Athelete

Column header – row value agreementtopClassScore = (numberOfVotes)/(numberofRows)

Compute topScoreYago & topScoreDBpedia

(both)Score < Threshold(topScoreYago || topScoreDBpedia) >= Threshold

Check for AlignmentIs Athelete sub/superClass of BasketBallPlayer ?

Columnn Header Annotation = BasketBallPlayer, Athlete

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming BasketballPlayerAtheleteLivingPeople

LivingPeopleAI_Researchers Change

No - Change

Update Column Header Annotation = “No-Annotation”

Update row value entity annotations

R11

Michael Jordan

1. Michael_I_Jordan 2. ….. 3. Michael_Jordan ….

LivingPeopleAI_Researchers

BasketBallPlayerAthlete

𝝍𝟑 “CHANGE”

Entity Class : BasketBallPlayer or

Athlete

R11 Michael Jordan Michael_Jordan

Candidate Entities for R11

Evaluation

• Dataset of 80 tables (Wikipedia tables; part of larger dataset released by IIT-Bombay)

• Evaluated Column Header Annotation Accuracy– How good was the mapping Team to

NationalBasketballAssociationTeams• Evaluated Entity Linking Accuracy

– Mapping Michael Jordan to Michael_Jordan

Column Header Annotation Accuracy

• System produced a ranked a list of Yago & DBpedia classes

• Human judges evaluated each class• For precision, judges scored each class

• 1 if the class was accurate• 0.5 if the class ok, but not best (e.g., Place vs. City)• 0 if it was incorrect

• For Recall, score 1 if accurate/correct, 0 for incorrect

522

259

422

Accurate

Okay

Incorrect

Top K classes, F-measure

Yago rank 1

Yago rank 2

Yago rank 3

dbp rank 1

dbp rank 2

dbp rank 3

GOOG IIT-B

F-Measure 0.688360450563

204

0.487568135310

842

0.438263244010

532

0.677154394707

449

0.626750158305

777

0.612644415917

844

0.67 0.56

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

Column Header Annotations

F-Measure

Entity linking accuracy

http://dbpedia.org/resource/Allen_IversonAllen Iverson

Correctly Linked Entities

3022

Incorrectly Linked Entities

959

Total Entities 3981

Accuracy : 75.91 %

Other Challenges• Using table captions and other text is

associated documents to provide context• Size of some data.gov tables (> 400K rows!)

makes using full graphical model impractical– Sample table and run model on the subset

• Achieving acceptable accuracy may require human input– 100% accuracy unattainable automatically– How best to let humans offer advice and/or

correct interpretations?

Final Conclusions

• Linked data great for sharing structured and semi-structured data– Backed by machine-understandable semantics– Uses successful Web languages and protocols

• Generating and exploring linked data resources is challenging– Schemas are too large, too many URIs

• New tools mapping tables to linked data and translating structured natural language queries reduce the barriers

http://ebiq.org/