Post on 17-Dec-2015
Christian Bizer: Fusing the Web of Data (12/08/2008)
3rd Asian Semantic Web Conference (ASWC 2008)DIST Workshop, Bangkok, Thailand
8 December 2008
Fusing the Web of Data
Christian Bizer, Freie Universität Berlin
Christian Bizer: Fusing the Web of Data (12/08/2008)
Overview
1. The Web of Data Linked Data Principles
Linked Data Deployment
Applications that consume Linked Data
2. Linked Data Fusion1. The Linking Process
2. Inconsistency Resolution
3. Provenance Tracking and Explanations
Christian Bizer: Fusing the Web of Data (12/08/2008)
The Classic Web
B C
HTML HTMLHTML
Web Browsers
Search Engines
hyper-links
Single global information space
1. URLs as globally unique IDs
retrieval mechanism
2. HTML as shared content format
3. Hyperlinks
Shortcomings
Content is not well structured
You can not ask expressive queries
You can not process content within applications
A
Christian Bizer: Fusing the Web of Data (12/08/2008)
Linked Data
B C
Thing
typedlinks
A D E
typedlinks
typedlinks
typedlinks
Thing
Thing
Thing
Thing
Thing Thing
Thing
Thing
Thing
Use Semantic Web technologies to1. publish structured data on the Web,2. set links between data from one data source
to data within other data sources.
Christian Bizer: Fusing the Web of Data (12/08/2008)
Linked Data Principles
1. Use URIs as names for things.
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful RDF information.
4. Include RDF statements that link to other URIs so that they can discover related things.
Tim Berners-Lee 2007
http://www.w3.org/DesignIssues/LinkedData.html
Christian Bizer: Fusing the Web of Data (12/08/2008)
The RDF Data Model
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri
Christian Bizer: Fusing the Web of Data (12/08/2008)
Data objects are identified with HTTP URIs
pd:cygri
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri
dbpedia:Berlin = http://dbpedia.org/resource/Berlin
Christian Bizer: Fusing the Web of Data (12/08/2008)
Dereferencing URIs over the Web
dp:Cities_in_Germany
3.405.259dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
pd:cygri
Christian Bizer: Fusing the Web of Data (12/08/2008)
Dereferencing URIs over the Web
dp:Cities_in_Germany
3.405.259dp:population
skos:subject
Richard Cyganiak
dbpedia:Berlin
foaf:name
foaf:based_near
foaf:Personrdf:type
dbpedia:Hamburg
dbpedia:Muenchen
skos:subject
skos:subject
pd:cygri
Christian Bizer: Fusing the Web of Data (12/08/2008)
2. Linked Data Deployment on the Web
B C
Thing
typedlinks
A D E
typedlinks
typedlinks
typedlinks
Thing
Thing
Thing
Thing
Thing Thing
Thing
Thing
Thing
Is this real?
Christian Bizer: Fusing the Web of Data (12/08/2008)
W3C Linking Open Data Project
Community effort to publish existing open license datasets as Linked Data on the Web
interlink things between different data sources
Christian Bizer: Fusing the Web of Data (12/08/2008)
LOD Datasets on the Web: May 2007
Over 500 million RDF triples Around 120,000 RDF links between data sources
Christian Bizer: Fusing the Web of Data (12/08/2008)
Example RDF Links
RDF links from DBpedia to other data sources
RDF link from a FOAF profile to DBpedia
<http://dbpedia.org/resource/Berlin> owl:sameAs
<http://sws.geonames.org/2950159> .
<http://richard.cyganiak.de/foaf.rdf#cygri> foaf:topic_interest
<http://dbpedia.org/resource/Semantic_Web> .
<http://dbpedia.org/resource/Tim_Berners-Lee> owl:sameAs
<http://www4.wiwiss.fu-berlin.de/dblp/resource/person/100007> .
Christian Bizer: Fusing the Web of Data (12/08/2008)
LOD Datasets on the Web: September 2008
> 2 billion RDF triples
> 6 million RDF links
Christian Bizer: Fusing the Web of Data (12/08/2008)
The Bio2RDF Project
Goals1. Make bioinformatics data available in RDF format on the Web.2. Promote the linked data vision within the bioinformatics community. 3. Answer questions which were not possible or practical to ask before.
Participants Université Laval, Canada Queensland University of Technology, Australia
Christian Bizer: Fusing the Web of Data (12/08/2008)
The Bio2RDF Cloud
27 data sources
260 million records
2,7 billion RDF triples
Christian Bizer: Fusing the Web of Data (12/08/2008)
3. Applications
B C
Thing
typedlinks
A D E
typedlinks
typedlinks
typedlinks
Thing
Thing
Thing
Thing
Thing Thing
Thing
Thing
Thing
Search Engines
Linked DataMashups
Linked DataBrowsers
What can I do with this?
Christian Bizer: Fusing the Web of Data (12/08/2008)
Linked Data Browsers
Tabulator Browser (MIT, USA)
Disco Hyperdata Browser (FU Berlin, DE)
OpenLink RDF Browser (OpenLink, UK)
Zitgist RDF Browser (Zitgist, USA)
Humboldt (HP Labs, UK)
Fenfire (DERI, Irland)
Marbles (FU Berlin, DE)
Christian Bizer: Fusing the Web of Data (12/08/2008)
Linked Data Mashups
Domain-specific applications using Linked Data from the Web
Christian Bizer: Fusing the Web of Data (12/08/2008)
DBtune Slashfacet
Visualizes music-related Linked Data Uses LastFM, MySpace, and BBC data
Christian Bizer: Fusing the Web of Data (12/08/2008)
DBpedia Mobile
Geospatial entry point into the Web of Data
Starts with DBpedia, Revyu and Flickr data
Christian Bizer: Fusing the Web of Data (12/08/2008)
Web of Data Search Engines
Falcons (IWS, China)
Sindice (DERI, Ireland)
MicroSearch (Yahoo, Spain)
Watson (Open University, UK)
SWSE (DERI, Ireland)
Swoogle (UMBC, USA)
Christian Bizer: Fusing the Web of Data (12/08/2008)
2. Linked Data Fusion
DataObject 1
DataObject 2
DataObject 3
DataObject 4
DataObject 5
DataObject 6
IntegratedView
Application
B C
owl:sameAs
A
owl:sameAs
Users want an integrated view on all data that is available about an real-world entity!
Christian Bizer: Fusing the Web of Data (12/08/2008)
Linked Data Fusion - Requirements
1. Map data into a single schema so that data can be rendered and queried properly.
2. Smush data from all sources about a single real-world entity while keeping track of information provenance.
3. Resolve inconsistencies in the data by applying different data fusion heuristics.
4. Be able to explain the fusion process Tim Berner-Lee‘s „Oh, yeah?“ button.
Christian Bizer: Fusing the Web of Data (12/08/2008)
Roles in the Linked Data Scenario
Data Publisher1. Publish data itself
2. Set RDF links to other data items describing the same real-world entity.
3. Reuse terms from existing vocabularies or set links to related schemata.
4. Publish metadata about
- provenance
- timeliness
- data license
Client Application1. Map data into single
schema.
2. Smush data from different sources about real-world entity.
3. Resolve inconsistencies in the data.
4. Keep track of information provenance and lineage.
5. Explain fusion process.
Christian Bizer: Fusing the Web of Data (12/08/2008)
2.1 Setting RDF Links
Today: Simple pattern- and graph-matching based techniques used to generate
links.
Usually proprietary code.
There is lots of existing work in database and knowledge representation communities on identity resolution to be used. Rule-based approaches
Distance-based techniques
Probabilistic matching
Supervised and unsupervised learning
Using a wide range of distance metrics
see: Elmagarmid et al: Duplicate Record Detection: A Survey. KaDE, 2007.
Christian Bizer: Fusing the Web of Data (12/08/2008)
Linking Frameworks
Goal: (Semi-)automatically generate RDF Links based on declarative rules.
Ongoing work Oktei Hassanzadeh (University of Toronto): ODDLinker
Andriy Nikolov et al. (Open University): KnoFuss
Julius Volz (Freie Universität Berlin): XXXX
seeAlso: http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/ EquivalenceMining
CREATE LINKS owl:sameAs BETWEEN a FROM dbpedia AND b FROM factbook RESTRICT a TO { ?a rdf:type dbpedia-owl:Country } METRIC { STRING_SIMILARITY(a/rdfs:label, b/rdfs:label), NUM_SIMILARITY(a/p:populationEstimate, b/factbook:population_total), NUM_SIMILARITY(a/p:areaKm, b/factbook:area_total) } THRESHOLDS MATCH 0.9 VERIFY 0.7;
Christian Bizer: Fusing the Web of Data (12/08/2008)
Schema Level RDF Links
Today: Simple mappings: owl:equivalentClass
owl:equivalentProperty
rdfs:subClassOf
rdfs:subPropertyOf
UMBEL effort:
Lots of existing work on schema/ontology matching to build on.
Missing: Agreed-upon way to publish more expressive mappingrules on the Web.
Christian Bizer: Fusing the Web of Data (12/08/2008)
2.2 Publish Metadata
Document Metadata Dublin Core, Semantic Web Publishing Vocabulary
Licensing Metadata Creative Commons Licensing Framework
Open Data Commons Public Domain Dedication & Licence (PDDL)
# Metadata and Licensing Information <http://dbpedia.org/data/Alec_Empire> rdf:type foaf:Document ; dc:publisher <http://dbpedia.org/resource/DBpedia> ; dc:date "2007-07-13"^^xsd:date ; dc:rights <http://en.wikipedia.org/wiki/WP:GFDL> .
# The Document Content <http://dbpedia.org/resource/Alec_Empire> rdf:type foaf:Person ; foaf:name "Empire, Alec" ; dbpedia-owl:associatedBand dbpedia:Atari_Teenage_Riot ;
Christian Bizer: Fusing the Web of Data (12/08/2008)
2.3. Provenance and Lineage Tracking
Named Graphs data model part of W3C SPARQL Recommendation
implemented by an increasing number of RDF stores
# TriG Representation of three Named Graphs :G1 { :Monica ex:name "Monica Murphy" . :Monica ex:homepage <http://www.monicamurphy.org> . :Monica ex:email <mailto:monica@monicamurphy.org> .} :G2 { :Monica rdf:type ex:Person . :Monica ex:hasSkill ex:Programming }
:G3 { :G1 swp:assertedBy _:w1 . _:w1 swp:authority :Chris . _:w1 dc:date "2003-10-02"^^xsd:date . :G2 swp:quotedBy _:w2 . _:w2 swp:authority :Chris . _:w2 dc:date "2003-09-03"^^xsd:date . }
Christian Bizer: Fusing the Web of Data (12/08/2008)
2.4. Inconsistency Resolution
There is lots of overlap betweenLOD datasets Places: Dbpedia, Geonames, Riese, …
People: Freebase, LinkedMDB, DBLP, …
Music: Dbpedia, Musicbrainz, Jamendo,..
There are naturally lots of inconsistencies Dbpedia: Person born at date X.
Freebase: Person born at date Y.
Dbpedia: Band album X.
Musicbrainz: Band album Y.
Geonames: City has geo-coordinates
Freebase: City has geo-coordinates
Christian Bizer: Fusing the Web of Data (12/08/2008)
Inconsistency Resolution Strategies
Pass it on. Pass conflicting values to the user and let him decide.
Take the information If value is missing in dataset 1, use value from dataset 2
Trust your friends Prefer information from certain sources.
Cry with the wolfes Choose most common value
Meet in the middle Take the averadge of all values
Keep up to data Use the newest value
SeeAlso: Bleiholder and Naumann: Conflict Handling Strategies in an Integrated Information System. WWW2006.
Christian Bizer: Fusing the Web of Data (12/08/2008)
2.5. Explain Data Provenance and Fusion Steps
Tim Berner-Lee‘s „Oh, yeah?“ button.
Existing Work: Deborah McGuinness et al: Inference Web: Portable Explanations for the
Web.
Chris Bizer: Web Information Quality Assessment Framework (WIQA)
Christian Bizer: Fusing the Web of Data (12/08/2008)
Outlook
Lots of exiting open issues to solve! DIST related technologies will be one of the hot topics
for next years (see for instance WWW2009)
Important for LOD Progress with Publishing Schema Mappings on the Web
Progress with Data Fusion
Linked Data client applications that address all issues mentioned
Please submit such solutions and client applications to the Semantic Web Challenge 2009
Linked Data on the Web (LDOW2009) workshop at WWW2009
IJSWIS Special Issue on Linked Data