From Data to Knowledge - Profiling & Interlinking Web Datasets

24
From data to knowledge profiling and interlinking Web datasets Stefan Dietze L3S Research Center 31/07/14 1 Stefan Dietze

description

Talk at KEYSTONE/ESWC meeting 2014

Transcript of From Data to Knowledge - Profiling & Interlinking Web Datasets

Page 1: From Data to Knowledge - Profiling & Interlinking Web Datasets

From data to knowledge –

profiling and interlinking Web datasets

Stefan Dietze

L3S Research Center

31/07/14 1 Stefan Dietze

Page 2: From Data to Knowledge - Profiling & Interlinking Web Datasets

Recent work on Linked Data exploration/discovery/search

Entity interlinking & dataset interlinking recommendation

Dataset profiling

Data consistency & conflicts

Research areas

Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)

Application domains: education/TEL, Web archiving, …

Some projects

Introduction

http://www.l3s.de/

31/07/14 2

See also: http://purl.org/dietze

Stefan Dietze

Page 3: From Data to Knowledge - Profiling & Interlinking Web Datasets

…why are there so few datasets actually used?

Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc

Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)

Explanations?

Linked Data is awesome, but...

31/07/14

„HTTP-accessibility“ (SPARQL, URI-dereferencing)

„Structure“ & „Semantics“ (=> shared/linked vocabularies)

„Interlinked“

„Persistent“

Hm,

really?

Stefan Dietze

Page 4: From Data to Knowledge - Profiling & Interlinking Web Datasets

Linked data is more diverse than we think

SPARQL endpoint availability over time [Buil-Aranda et al 2013]

Accessibility of datasets?

Less than 50% of all SPARQL endpoints actually responsive at given point of time

“THE” SPARQL protocol? No, but many variants & subsets

“Semantics”, links, quality?

…data consistency? [? Yuan2014 ?]

…data accuracy (eg DBpedia)? [Paulheim2013]

…vocabulary reuse? [D’AquinWebSci13]

…schema compliance (RDFS, schemas) [HoganJWS2012]

Stefan Dietze

SPARQL Web-Querying Infrastructure: Ready for Action?,

Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves

Vandenbussch, International Semantic Web Conference 2013,

(ISWC2013).

Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A.,

Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013.

Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC

2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525

An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth,

A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012

Page 5: From Data to Knowledge - Profiling & Interlinking Web Datasets

Too many/diverse datasets, too little knowledge

Stefan Dietze 31/07/14

? ? ?

? ? ?

Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?

Types: which datasets describe statistics, videos, slides, publications etc?

Currentness, dynamics, accessability/reliability, data quantity & quality?

Page 6: From Data to Knowledge - Profiling & Interlinking Web Datasets

db:Astro. Objects

Dataset Metadata

Stefan Dietze 31/07/14

BIBO

AAISO

FOAF

contains

Entity disambiguation & linking [ESWC13]

Topic profile extraction [WWW13, ESCW14]

db:Astronomy

db:Astro. Objects

Dataset

Catalog/Registry

yov:Video

po:Programme

BBC Programme

<po:Programme …>

<po:Series>Wonders of the Solar System</.>

<po:Actor>Brian Cox</…>

</po:Programme…>

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

bibo:Fil bibo:Fi

bibo:Film

Schema mappings [WebSci13]

Data curation, linking and dataset profiling

Page 7: From Data to Knowledge - Profiling & Interlinking Web Datasets

Schemas/vocabularies on the Web: XKCD 927

Stefan Dietze 31/07/14

https://xkcd.com/927/ schemas & vocabularies

Page 8: From Data to Knowledge - Profiling & Interlinking Web Datasets

typeX typeX

Schema assessment and mapping

Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties)

Co-occurence after mapping into most frequent schemas

(201 frequent types mapped into 79

classes)

Assessing the Educational Linked Data Landscape,

D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science

2013 (WebSci2013), Paris, France, May 2013.

bibo:Film

bibo:Document

po:Programme sioc:Item

31/07/14

foaf:Document

yov:Video

typeX

Page 9: From Data to Knowledge - Profiling & Interlinking Web Datasets

LinkedUp Data Catalog in a nutshell

http://datahub.io/group/linked-education

http://data.linkededucation.org/linkedup/catalog/

RDF (VoID) dataset catalog: browse & query distributed datasets

Live information about endpoint accessibility

Federated queries using type mappings

Stefan Dietze 31/07/14

http://datahub.io/group/linked-education

Page 10: From Data to Knowledge - Profiling & Interlinking Web Datasets

31/07/14

Dataset interlinking recommendation Candidate datasets for interlinking?

13

t

Linkset1

Linkset2

Approach

Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)

Features:

Vocabulary overlap

Existing links (SNA)

Linking candidates likely if datasets share common (a) schema elements, or (b) links (friend of a friend)

Conclusions

Roughly 60% MAP for both approaches

Future work: quantity of links, extraction of experimental data from datasets…

Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,

Dietze, S., Recommending Tripleset Interlinking through a

Social Network Approach, The 14th International Conference

on Web Information System Engineering (WISE 2013),

Nanjing, China, 2013.

Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,

M.A., Dietze, S., Identifying candidate datasets for data

interlinking, in Proceedings of the 13th International

Conference on Web Engineering, (2013).

Rank

1 DBLP

2 ACM

3 OAI

4 CiteSeer

5 IBM

6 Roma

7 IEEE

8 Ulm

9 Pisa

?

?

Stefan Dietze

Page 11: From Data to Knowledge - Profiling & Interlinking Web Datasets

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video

Topics/categories addressed? Relatedness of resources/entities? (types, semantics)

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

Combining a co-occurrence-based and a semantic measure

for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.

Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended

Semantic Web Conference, (May 2013).

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,

Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended

Semantic Web Conference (ESWC2014), Crete, Greece, (2014).

Dataset & entity linking: semantics of resources/datasets?

16 Stefan Dietze 31/07/14

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

Pluto?

Page 12: From Data to Knowledge - Profiling & Interlinking Web Datasets

db:Pluto

(Dwarf

Planet)

db:Astrono-

mical Objects

db:Sun

Disambiguation/linking using background knowledge „Semantic relatetedness“ of resources?

db:Astronomy

17

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

Video

Stefan Dietze 31/07/14

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Page 13: From Data to Knowledge - Profiling & Interlinking Web Datasets

db:Pluto

(Dwarf

Planet)

db:Astrono-

mical Objects

db:Astronomy

Computation of connectivity scores between resources/entities

Method: combination of a

(i) semantic (graph-based) connectivity score (SCS) with

(ii) a Web co-occurence-based measure (CBM) (similar to NGD)

For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties)

db:Sun

SCS = 0.32

CBM = 0.24

http://purl.org/vol/doc/

http://purl.org/vol/ns/

19/09/2013 18 Stefan Dietze

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

Entity linking: semantic relatedness

<sioc:Item 2139393292>

<title>Planetary motion

& gravity</title>

</sioc:Item 2139393292>

Slideset

<po:Programme519215>

<po:Series>Wonders of the Solar

System</po:Series>

<po:Episode>Emp. of the Sun</po:Episode>

<po:Actor>Brian Cox</po:Actor>

</po:Programme519215 >

Programme

<yo:Video 8748720>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video 8748720>

Video

Page 14: From Data to Knowledge - Profiling & Interlinking Web Datasets

Entity linking: evaluation

31/07/14 19 Stefan Dietze

Evaluation based on USA Today News items (80.000 entity pairs)

Manually created gold standard (1000 entity pairs)

Baseline: Explicit Semantic Analysis (ESA)

=> CBM/SCS: „relatedness“; ESA: „similarity“

Precision/Recall/F1 for SCS, CBM, ESA.

Combining a co-occurrence-based and a semantic

measure for entity linking, B. P. Nunes, S. Dietze, M.A.

Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013

- 10th Extended Semantic Web Conference, (May 2013).

Page 15: From Data to Knowledge - Profiling & Interlinking Web Datasets

db:Astro. Objects

Dataset Metadata

Stefan Dietze 31/07/14

Entity disambiguation & linking [ESWC13]

Topic profile extraction [WWW13, ESCW14]

db:Astronomy

db:Astro. Objects

Dataset

Catalog/Registry

yov:Video

<yo:Video …>

<dc:title>Pluto & the

Dwarf Planets</dc:title>

</yo:Video…>

Yovisto Video

Extracting representative metadata („topic profile“) for datasets

Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets

Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

Dataset profiling: what‘s the data about?

Page 16: From Data to Knowledge - Profiling & Interlinking Web Datasets

Dataset profiling: approach

Stefan Dietze 31/07/14

1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling)

2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion)

3. Normalisation and ranking (using graphical-models such as PageRank with Priors, HITS with Priors and K-Step Markov)

=> Result: weighted dataset-topic profile graph

A Scalable Approach for Efficiently Generating

Structured Dataset Topic Profiles, Fetahu, B.,

Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,

11th Extended Semantic Web Conference

(ESWC2014), Crete, Greece, (2014).

Page 17: From Data to Knowledge - Profiling & Interlinking Web Datasets

Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/

Stefan Dietze 31/07/14

Automatic extraction of dataset “topics” [ESWC2014] => RDF/VoiD dataset profiles

Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)

Includes all (responsive) datasets of LOD Cloud

Page 18: From Data to Knowledge - Profiling & Interlinking Web Datasets

Dataset profiling: results evaluation

Stefan Dietze 31/07/14

NDCG (averaged over all datasets) .

Datasets & Ground Truth

Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood

Crowd-sourced topic indicators from datasets (keywords, tags)

Manual mapping to entities & category extraction (ranking according to frequency)

Baselines

1) LDA, 2) tf/idf (applied to entire datasets)

Topic extraction according to our approach, weighting/ranking based on term weight

Measure

NDCG @ rank l

Performance (time/NDCG) for different sampling strategies/sizes etc

Page 19: From Data to Knowledge - Profiling & Interlinking Web Datasets

31/07/14

dbp:Category:Royal_Medal_winners

dbp:Category:1955_births dbp:Category:People_from_London

dbp:Category:Buzzwords

dbp:Category:Web_Services

dbp:Category:HTTP

dbp:Category:Unitarian_Universalists

dbp:Category:World_Wide_Web

What have these categories in common?

Stefan Dietze

Page 20: From Data to Knowledge - Profiling & Interlinking Web Datasets

31/07/14

Diversity of category profile for a single paper

Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine.

person

document

dbp:Tim_Berners-Lee

dbp:Category:1955_births dbp:Category:People_from_London

dbp:Category:Buzzwords

dbp:Semantic_Web

dbp:Category:Semantic_Web

dbp:Category:Web_Services

dbp:Category:HTTP

dbp:Category:Unitarian_Universalists

first-level categories (dcterms:subject)

dbp:Category:World_Wide_Web

dbp:Category:Royal_Medal_winners

Stefan Dietze

Page 21: From Data to Knowledge - Profiling & Interlinking Web Datasets

31/07/14

http://data-observatory.org/led-explorer/

Type specific views on datasets/ categories

“Document” (foaf:document)

“Person “ (foaf:person)

“Course” (aaiso:course)

Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here)

Type-specific exploration of dataset categories

Stefan Dietze

Page 22: From Data to Knowledge - Profiling & Interlinking Web Datasets

May –September 2013 October 2013 – May 2014 May 2014 – October 2014

Series of Open Data Competitions to promote applications which exploit Linked Open Data

http://www.linkedup-challenge.org/

http://www.linkedup-project.eu/

LinkedUp Challenge: Linking Web Data (for Education)

“Vici” open just now

Final events at ISWC2014

Submission: 5 September

Page 23: From Data to Knowledge - Profiling & Interlinking Web Datasets

Conclusions & future work

Summary

Increasing amounts of data => require knowledge about nature and relationships of datasets

Profiling: scalable methods for extracting dataset metadata

Interlinking: connectivity of entities or datasets

Future work – LD evolution, preservation, consistency

In RDF graphs (eg LOD Cloud), „all“ nodes are connected

LD preservation: which datasets to preserve (entity „neighbourhood“)? => semantic relatedness

Link correctness in evolving LD: investigating impact of changes on link correctness

Application: informed preservation and enrichment strategies

31/07/14 37 Stefan Dietze

Page 24: From Data to Knowledge - Profiling & Interlinking Web Datasets

Thank you!

WWW See also (general)

http://linkedup-project.eu

http://linkededucation.org

http://data.l3s.de

http://purl.org/dietze

See also (data)

http://data.linkededucation.org

http://data.linkededucation.org/linkedup/catalog/

http://lak.linkededucation.org

31/07/14 38 Stefan Dietze

Besnik Fetahu (L3S)

Elena Demidova (L3S)

Bernardo Pereira Nunes (PUC Rio)

Marco Casanova (PUC Rio)

Luiz Andre Paes Leme (PUC Rio)

Giseli Lopes (PUC Rio)

Davide Taibi (CNR, IT)

Mathieu d’Aquin (Open University, UK)

and many more…

Acknowledgements