SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
-
Upload
ansgar-scherp -
Category
Technology
-
view
848 -
download
0
description
Transcript of SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX Creating the Yellow Pages of the LOD Cloud
Mathias Konrath, Thomas Gottron, Ansgar Scherp
2 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Scenario• People who are politicians and actors
• Who else?• Where do they live? • Whom do they know?
3 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Problem
• Execute those queries on the LOD cloud
Relevant sources?
• No single federated query interface provided
“politicians and actors”
4 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Principle Solution
• Suitable index structure for looking up sources
“politicians and actors”
5 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
The Naive Approach
1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup
- Big machinery- Late in processing the data- High effort to scale with LOD cloudCan we do smarter?
6 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Yes, we can …
7 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
The SchemEX Approach• Stream-based schema extraction• While crawling the data
Nquad-Stream
SchemaSchema-Extractor
Parser
Instance-Cache
LOD-CrawlerRDF-DumpTriple Store
NxParser
RDFRDBMS
FIFO
8 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
• Observe a quadruple stream from LD spider
Efficient Instance Cache
• Ring queue, backed up by a hash map• Organizes triples with same subject URI• Dismiss oldest, when cache full (FIFO)→ Runtime complexity O(1)
9 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
DS1 DS2 DS3 DS4 DS5 DSxData
sources
consistsOf
hasDataSource
Building the Schema and Index
…
EQC1 EQC2 EQCn Equivalenceclasses
hasEQClass p1 p2
…
TC1 TC2 TCm
Type clusters…
c2c1 c3 ckRDF
classes…
10 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec
• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+
11 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Computing SchemEX: Full BTC 2011 Data
Cache size: 50 k
12 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Conclusions: SchemEX
• Stream-based approach to schema extraction • Scalable to arbitrary amount of Linked Data • Applicable on commodity hardware
(4GB RAM, standard single CPU)
• Lookup-index to find relevant data sources• Support federated queries on the LOD cloud
13 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
BACKUP
14 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
SchemEX Computation: Window Sizes
Crawled TimBL dataset (11M triples)
Runtime increases hardly withgreater window sizes
Memory consumption scaleswith window size
15 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
SchemEX Quality: Precision
16 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
SchemEX Quality: Recall
17 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Example Data Graph
18 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
Output Vocabulary: voiD
19 of 12 SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp
SchemEX Extraction: Progress PlotC
ount
# processed instances# processed instances
Type-clusterEquivalence classes