ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
-
Upload
blerina-spahiu -
Category
Internet
-
view
226 -
download
0
Transcript of ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
Blerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino
University of Milano-Bicocca ([email protected])
Outline
Motivation Dataset Understanding State of the Art
Summarization Framework Abstract Knowledge Patterns (AKPs) Pattern Minimalization Summary extraction, storage and presentation
Evaluation Compactness Informativeness User Study
Conclusion and Future Work
2University of Milan - Bicocca
Introduction
What types of resources are there in a data set? How are they described? What types of resources are linked by a certain property and how frequently?
Motivation
Understanding the content of data sets is challenging Looking at the ontology is not enough:
Ontologies may be large and underspecified
• DBpedia 2015-04: 2795 properties, domain not specified for 259 properties, range not specified for 187 properties
• No information about the usage Explorative queries are too expensive
Significant server overload High response time/timeout
State of the Art
University of Milan - Bicocca 5
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015Zhang et al. 2007
Identifying subsets of data sets or ontologies that are considered to be more relevant
Aim at extracting knowledge patterns for a complete representation of the data set
Mihindukulasooriya et al. 2015Persutti et al. 2011M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data and aim at extracting stronger assertions
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012Langegger and W. Wöb, 2009Auer et al. 2012Linked Open Vocabularies (http://lov.okfn.org/)
Aim at reporting statistics about the usage of different vocabularies, properties and types in the data
State of the Art
University of Milan - Bicocca 6
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015Zhang et al. 2007
Identifying subsets of data sets or ontologies that are considered to be more relevant.
Aim at extracting knowledge patterns for a complete rapresentation of the dataset.
Mihindukulasooriya et al. 2015Persutti et al. 2011M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data and aim at extracting stronger assertions.
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012Langegger and W. Wöb, 2009Auer et al. 2012Linked Open Vocabularies (http://lov.okfn.org/)
Aim at reporting statistics about the usage of different vocabularies, properties and types in the data.
ABSTAT
ABSTAT
ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven linked data summarization framework
A summary provides a complete but compact schema-level representation of a data set A set of Abstract Knowledge Patterns (AKPs) Statistics
An AKP represents the fact that there are instance of type Person linked with instances of type Settlement by the property birthplace
How many times does this pattern occur in the data set
How many times does a certain type occur as minimal type and how many time does the property occur in the dataset
Abstract Knowledge Patterns (AKPs)
ABSTAT adopts a minimalization mechanism based on minimal type patterns
Minimalization is based on a subtype graph which represents the data ontology
Abstract Knowledge Patterns (AKPs) are abstract representations of Knowledge Patterns
An AKP is a triple (C; P; D ) such that C and D are types and P is a property
In ABSTAT we represent only a set of AKP occurring in the data set, those that are minimal types
Person
Sportist
FootballPlayer
Lawyer
Jim Brown
AmalClooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George Clooney
birthDate
= types= instances= literals
.subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:<Artist, hasWife, Lawyer><FootballPlayer, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
typetype
type
Person
Sportist
FootballPlayer
Lawyer
Jim Brown
AmalClooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George Clooney
birthDate
= types= instances= literals
.subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:<Artist, hasWife, Lawyer><FootballPlayer, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
typetype
typeRedundant patterns excluded by the summary:<Person, hasWife, Person><Sportist, birthDate, XMLSchema#Date><Person, birthDate, XMLSchema#Date>
Person
Sportist
FootballPlayer
Lawyer
Jim Brown
AmalClooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George Clooney
birthDate
= types= instances= literals
.subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:<Artist, hasWife, Lawyer><FootballPlayer, birthDate, XMLSchema#Date><Artist, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
typetype
typetype
Summary Extraction Workflow
13
ABSTAT User Interfaces
ABSTAT homepage
(http://abstat.disco.unimib.it)
ABSTATBrowse
(http://abstat.disco.unimib.it/browse)
ABSTATSearch
(http://abstat.disco.unimib.it/search)
SPARQL Endpoint
(http://abstat.disco.unimib.it/sparql)
University of Milan - Bicocca
Experimental Evaluation
Summary compactness Number of patterns in the summary vs. number of triples in the
data set Comparison with a similar approach without minimalization
Summary informativeness Insights about the semantics of the properties Small-scale user study
Compactness
Dataset Relational Typing Assertions Types (Ext.) Properties (Ext.) Patterns
DBpedia Core 2014 40.5M 29.7M 70.1M 869 (85) 1439 (15) 171340
DBpedia 3.9 Infobox 96.3M 19.7M 116.4M 821 (58) 62572 (14) 732418
Linked Brainz 180.1M 39.6M 221.7M 21 (9) 33 (0) 161
Reduction Rate =
Dataset ABSTAT LOUPE
DBpedia Core 2014 0.002 0.01
Linked Brainz 6.72 10-7 7.1 10-7
Minimalization produces more compact summaries Advantage of minimalization is more observable for datasets with
richer subtype graphs and typing assertions
Data sets and summaries statistics
Reduction rate
Number of patterns
Number of assertions in the data set
Similar to ABSTAT without minimalization
Informativeness
ABSTAT summaries provide useful insights about the semantics of properties, based on their usage within a data set
Dataset Missing Domain (%)
Missing Range (%)
Missing Domain & Range (%)
DBpedia Core 2014 259 (18%) 187 (13%) 48 (3.3%)
DBpedia 3.9 Infobox 61368 (98%)
61309 (98%)
61161 (97%)
Linked Brainz 13 (39%) 15 (45%) 13 (39%)
Inferred domain and range for DBpedia Core 2014
dbo:t
ype
dbo:s
ucce
ssor
dbo:d
ivisio
n
dbo:i
sPartO
f
dbo:s
eries
dbo:g
ender
dbo:s
ource
dbo:l
ocalA
utho..
.
dbo:r
oyalA
nthem
dbo:m
ainIntere
st
dbo:c
hairL
abel
dbo:f
ormat
dbo:m
anag
e...
dbo:r
elated
dbo:h
asVaria
nt
dbo:v
ariantO
f
dbo:n
amedAfte
r0
20
40
60
80
100
120
140
160
Extracted minimal types (domain)
Num
ber o
f min
imal
type
s
User Study: Setup
Can ABSTAT be useful to support query formulation? Queries to DBpedia 3.9 Infobox from the Questions and
Answering in Linked Open Data benchmark 5 queries of increasing length (1 of length 1, 2 of length 2
and 2 of length 3) 20 participants, 2 groups:
abstat group uses ABSTAT (after 20 min of training)control group does not use ABSTAT
Measures:Time needed to formulate the queryAccuracy of the answer
19
User Study: Questionnaire
University of Milan - Bicocca
User Study: Results
Group Avg. Completion Time (s) AccuracyQuery 1- length 1 How many employees does Google
have?
abstat 358.9 0.9control 380.6 0.8
Query 2- length 2 Give me all people that were born in Vienna and died in Berlin.
abstat 356.3 1control 346.9 0.8
Query 3- length 2 Which professional surfers were born in Australia?
abstat 476.6 0.6
control 234.24 0.7Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts
starring?
abstat 333.4 0.9
control 445.6 0.9
Query 5- length 3 Give me all books by William Goldman with more than 300 pages.
abstat 233.4 1control 569.8 0.7The independent t-test showed that there was a significant effect between two groups for answering correctly Q5: t(16) = 10.32, p < .005
User Study: Results Analysis
abstat group users benefit from ABSTAT summary in terms of average completion time, accuracy, or both Increasing accuracy over increasing difficulty, performing the tasks faster Exception is query 3, because the individual Surfing is classified with no
type other than owl:Thing
Two used strategies to answer the queries by participants from the control group were: To directly access the public web page describing the DBpedia named
individuals mentioned in the query Very few submitted explorative SPARQL queries to the endpoint
Conclusion and Future Work
ABSTAT: ontology-driven summarization with minimalization Sensible reduction rate and promising results about the
informativeness of the summary Currently extending the user study
Apply relevance-oriented summarization methods based on connectivity analysis
ABSTAT summary should consider the inheritance of properties to produce even more compact summaries
We envision a complete analysis of the most important data set available in the LOD cloud (20+ data sets available)
APIs available soon
Thank you for your attention!
23University of Milan - Bicocca
24
www.abstat.unimib.it
University of Milan - Bicocca
Feedback is WELCOMED!