The state of the art in Linked Data
-
Upload
joshua-shinavier -
Category
Technology
-
view
4.020 -
download
3
description
Transcript of The state of the art in Linked Data
The state of the art in Linked Data
Advanced Semantic Web, Spring 2009
Joshua Shinavier
Literature Survey
• Linked Data
• Linking Open Data
• describing linked datasets
• growing the data web
• keeping Linked Data connected
• indexing and searching
• applications
• navigation
• state of the data web
Outline
2
• resource -- an item of interest
• URI -- global identifier for a resource
• representation -- data corresponding to the state of a resource
• information resource -- a “document” containing information
• non-information resource -- anything else
• associated description -- representation describing a Semantic Web resource
Linked Data overview
3
• “bootstrap” the data web with large, interconnected data sets to reach a critical mass of semantics
• strict adherence to W3C standards
• identification and transportation (URI, HTTP) of resource descriptions
• interpretation (RDF, RDFS, OWL) of resource descriptions
• LOD grows as data providers:
• publish structured data on the Web
• set RDF links between entities in different data sources
• transition of the web from a distributed document repository into a universal, ubiquitous database [Erling 09]
The Linking Open Data initiative
4
The LOD cloud
5
LOD data sets
6
Link sets in LOD
7
• voiD (Vocabulary of Interlinked Datasets) [Alexander, Cyganiak, Hausenblas, Zhao 09]
• describes data sets the link sets between them
• DING (Dataset RankING) [Toupikov, Umbrich, Delbru, Hausenblas, Tummarello 09]
• ranking of linked datasets using formal descriptions
• modeling of the Linked Data domain [Halpin, Presutti 09]
Describing linked datasets
8
• network-shaped Entity Name System to enable systematic reuse of URIs [Bouquet, Stoermer, Cordioli, Tummarello 08]
• similar to DNS for interlinking hypertext
• n2Mate framework [Peterson, Cregan, Atkinson, Brisbin 08]
• use social networking principles to facilitate vocabulary and instance reuse
• graph-based disambiguation of Semantic Web entities with idMesh [Cudré-Mauroux, Haghani, Jost, Aberer, de Meer 09]
Keeping Linked Data connected
9
• many conflated resources in DBpedia [Jaffri, Glaser, Millard 08]
• representative of LOD as a whole
• Co-Reference Resolution Service [Glaser, Jaffri, Millard 09]
• when co-reference is context-specific, owl:sameAs is inappropriate
• stores co-reference information as a first-class entity
• ontology-level alignment should precede data-level alignment [Nikolov, Uren, Motta 09]
Managing co-reference
10
• how to get data out there?
• challenges of the read-write Semantic Web
• user awareness of social context of data (e.g. licensing, privacy)
• view update problem
• is the wiki model applicable?
• incentives for posting data on the SW
• validating existing Linked Data with Vapour [Berrueta, Fernandez, Frade 08]
Growing the data web
11
• DBpedia [Auer, Bizer, Kobilarov, Lehmann, Cyganiak, Ives 07]
• extracts structured information from Wikipedia
• linking hub for the LOD cloud
• RDF Book Mashup [Bizer, Cyganiak, Gauss 07]
• product metadata from Amazon.com
Examples of LOD data sets
12
• Linked Movie Database [Hassanzadeh, Consens 09]
• combines data from IMDb, Freebase, OMDB, DBPedia, RottenTomatoes.com, Stanford Movie Database
• interlinked music datasets [Raimond, Sutton, Sandler 08]
• combines data from Jamendo on DBTune, BBC John Peel sessions, SBSimilarity, Musicbrainz, DBpedia, Geonames
• links artists, albums, tracks, personal music collections
• generated links based similarity of resources, similarity of neighbors
Music and movies as Linked Data
13
• the hypertext Web itself [Li, Zhao 08]
• extraction of semantic links from hypertext links and hierarchical relationships among Web documents
• RDF representation of HTML DOM from using SparqPlug [Coetzee, Heath, Motta 08]
• multimedia metadata
• interlinking multimedia fragments [Hausenblas, Troncy, Bürger, Raimond 09]
Other sources of data
14
• XML Business Reporting Language (XBRL) [Garcia, Gil 09]
• mapping data to RDF and schemas to OWL facilitates interoperability
• large thesauri [Neubert 09]
• as interlinking hubs for professional communities
• enterprise data, e.g. technical documentation [Servant 08]
• MARC21 bibliographic records [Styles, Ayers, Shabir 08]
Other sources of data (cont.)
15
• D2R Server for customizable mappings from relational databases to ontologies [Bizer, Cyganiak 06]
• browser-based tools for defining RDB-to-RDF mappings [Zhou, Xu, Chen, Idehen 08]
• Triplify [Auer, Dietzold, Lehmann, Hellmann, Aumueller 09]
• from generic data silos to Linked Data using OpenLink Data Spaces [Idehen, Erling 08]
Mapping tools
16
• Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
• can be made Web-accessible with OAI2LOD Server [Haslhofer, Schandl 08]
• Open Archives Initiative - Object Reuse and Exchange (OAI-ORE) [Van de Sompel, Lagoze, Nelson, Warner, Sanderson, Johnston 09]
• adheres to Web principles
Aggregated resources
17
• existing Linked Data datasets are more appropriate for machine than human consumption
• template-generated interlinks are of limited quality
• data from existing silos quickly becomes out of date
• need human involvement to grow the data web organically
User-driven Linked Data
18
User-driven Linked Data (cont.)
19
• direct modification using SPARQL/Update
• e.g. in Tabulator [Berners-Lee, Hollenbach, Lu, Presbrey, Prud’hommeaux, Schraefel 08]
• User Contributed Interlinking [Halb, Raimond, Hausenblas]
• semantic wikis
• Loomp [Roesch, Heese 09]
• semantic annotation of content using a text editor interface
• public data from existing social networks
• wrappers for Web 2.0 services [Passant 08]
• unifying personal identity across various networks [Rowe 09]
• Semantically Interlinked Online Communities (SIOC)
• integrating social media sites (forums, blogs, wikis, etc. with the data web [Bojars, Passant, Cyganiak, Breslin 08]
• Meaning of a Tag (MOAT) ontology gives meaning to tags on Web 2.0 [Passant, Laublet 08]
User-driven Linked Data (cont.)
20
• usability (for humans) of Linked Data [Halb, Raimond, Hausenblas 08]
• current LOD datasets are primarily for machine consumption
• low semantic strength of current LOD link sets
• provenance information for Linked Data [Hartig 09]
• Open Data Commons license [Miller, Styles, Heath 08]
Usability and licensing
21
• W3C’s TAP semantic search [Guha, McCool 01]
• Swoogle [Ding, Finin, Joshi, Pan, Cost, Peng, Reddivari, Doshi, Sachs 04]
• adapts PageRank concept to ontologies
• SWSE [Hogan, Harth, Umbrich, Decker 07]
• MultiCrawler [Harth, Umbrich, Decker 06]
• RDF Gateway search
• Watson document-based search
• Falcons [Cheng, Ge, Wu, Qu 08]
• textual search using class hierarchies for query restriction
• Sindice Semantic Web index [Tummarello, Delbru, Oren 07]
22
Indexing and searching
• Silk link discovery framework [Volz, Bizer, Gaedke, Kobilarov 09]
• find relationships between entities within different data sources
• generation of owl:sameAs links
• value of Web of Data depends on the amount and quality of links between data sources
Link discovery
23
Navigation
24
• like early Web, it’s easy to get “Lost in Hyperspace”
• Tabulator generic Linked Data browser [Berners-Lee, Chen, Chilton, Connolly, Dhanaraj, Hollenbach, Lerer, Sheets 06]
• encourage deployment of Linked Data
• test, refine and promote Linked Data standards
• faceted views over large-scale linked data with Virtuoso Cluster Edition [Erling 09]
• Explorator RDF browser [Araujo, Schwabe 09]
• exploratory search using direct manipulation
• DBPedia Mobile map view and faceted Linked Data browser [Becker, Bizer 08]
• explore the geospatial Semantic Web
• uses current GPS position as a starting point
• potential for Linked Data publishing
Navigation (cont.)
25
• Fenfire generic Linked Data browser [Hastrup, Cyganiak, Bojars 08]
• uses graph views rather than tables or outlines
• shows graph data as directly as possible
• related to Fentwine [Fallenstein, Lukka 04]
Navigation (cont.)
26
• Humboldt [Kobilarov, Dickinson 08]
• exploratory browsing
• faceted views
• “resource at a time”
• uses a “pivot” operation to refocus the view
Navigation (cont.)
27
• zLinks plugin [Bergman, Giasson 08]
• WordPress plugin with supporting server
• relates hypertext links with contextually relevant Linked Data
• WOWY (WordNet, OpenCyc, Wikipedia, YAGO)
• distinguish between types of resources
• disambiguate alternate senses
Navigation (cont.)
28
• mapping of Linked Data to a file system model [Schandl 09]
• enables use of this data within desktop applications
Navigation (cont.)
29
• how to use the data that is out there?
• emerging applications which exploit Linked Data [Hausenblas 09]
• integrating data sources related to drug and clinical trials [Jentzsch, Andersson, Hassanzadeh, Stephens, Bizer 09]
• mashups
• MashQL [Jarrar, Dikaiakos 09]
• Internet is a database, mashup is a query over that database
• benefit of specialized, independent Linked Data services acting together [Bojars, Passant, Giasson, Breslin 07]
Other applications
30
The gray area
31
• U-P2P framework for peer-to-peer linked data [Davoust, Esfandiari 09]
• data replication provides a measure of popularity
• Linked Data with Named Graphs
• e.g. interlinks with embedded provenance information [Zhao, Klyne, Shotton 08]
• Ripple scripting language [Shinavier 07]
• embeds Turing-complete programs in the Web of Data
• where are we with the Linked Data graph?
• size
• number and type of links
• usefulness to end users
• network characteristics
• single-point-of-access (e.g. DBpedia, GeoNames) vs. distributed datasets (e.g. FOAF-o-sphere, SIOC-land)
• syntactic and semantic analysis of the LOD dataset [Hausenblas, Halb, Raimond, Heath 08]
State of the data web
32
• today’s Linked Data is very different than the first-generation data web [Halpin 09]
• LOD data accounts for the vast majority of data
• power-law distributions are emerging
• data web is not growing organically
• Web standards are generally adhered to
• is Linked Data useful to ordinary users?
• sampling of Linked Data using Live.com query logs and FALCON-S semantic search engine
Statistics of the data web
33
• ...
Query popularity follows a power law
34
• ...
URI frequency... not so much
35
• ...
Data publishing lacks a “long tail”
36
A few dominant ontologies are emerging
37
# of URIs by vocabulary
(DBpedia bias)
38
# of URIs by domain name
• common network analysis techniques can be used to investigate interoperability and structural patterns of the LOD cloud [Rodriguez 09]
• results based on March 2009 statistics of the LOD data set graph:
• LOD graph is not strongly connected
• diameter of 8 is large given relatively small size of the cloud
• data sets have nearly identical incoming and outgoing link patterns (⇒ majority of reciprocal owl:sameAs links)
Graph analysis for the data web
39
Ranking and clustering of LOD data sets
40
41
• Original slide show:
• http://tw.rpi.edu/proj/portal.wiki/images/f/f0/LinkedData.pdf
• References:
• http://tw.rpi.edu/proj/portal.wiki/images/e/e0/LinkedDataSurvey.pdf
• BibTeX:
• http://tw.rpi.edu/proj/portal.wiki/images/3/37/LinkedDataSurvey.bbl