Towards a Top-K SPARQL Query Benchmark Generator

29
Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh 1 , Emanuele Della Valle 1 , Daniele Dell’Aglio 1 , and Alessandro Bozzon 2 1 Politecnico di Milano 2 TU Delft

description

Towards a Top-K SPARQL Query Benchmark Generator. Shima Zahmatkesh 1 , Emanuele Della Valle 1 , Daniele Dell’Aglio 1 , and Alessandro Bozzon 2 1 Politecnico di Milano 2 TU Delft. Agenda. Rankings, Rankings everywhere What are top-k SPARQL queries Jim Gray's Benchmarking Principles - PowerPoint PPT Presentation

Transcript of Towards a Top-K SPARQL Query Benchmark Generator

Page 1: Towards a Top-K SPARQL Query Benchmark Generator

Towards a Top-K SPARQL Query Benchmark Generator

Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2

1Politecnico di Milano 2TU Delft

Page 2: Towards a Top-K SPARQL Query Benchmark Generator

Agenda

• Rankings, Rankings everywhere • What are top-k SPARQL queries• Jim Gray's Benchmarking Principles• The problem• Some Definitions• Research Hypothesis• Background work: DBpedia SPARQL Benchmark• Our proposal: Top-k DBPSB• Preliminary Evaluation• Conclusions

Page 3: Towards a Top-K SPARQL Query Benchmark Generator

3

Rankings, rankings everywhere

Page 4: Towards a Top-K SPARQL Query Benchmark Generator

4

Rankings, rankings everywhere

Page 5: Towards a Top-K SPARQL Query Benchmark Generator

5

Rankings, rankings everywhere

Page 6: Towards a Top-K SPARQL Query Benchmark Generator

6

A very intuitive and simplified example:

• Top 3 largest countries (by both area and population)

Why do we need to optimize them?

Page 7: Towards a Top-K SPARQL Query Benchmark Generator

7

The standard way: materialize-then-sort scheme

Countries

Compute the scoring function that accounts for area and population

Sort all the 242 countries

Fetch 3 best results

……

242

Page 8: Towards a Top-K SPARQL Query Benchmark Generator

8

Innovative optimization:Split-and-Interleave scheme

Fetch 3 best results

Incrementally order partial results by area

Sorted access to countries ordered by population

Countries 242

9

3

Page 9: Towards a Top-K SPARQL Query Benchmark Generator

9

State-of-the artDatabase• method

– Split the evaluation of the scoring function into single criteria

– Interleave them with other operators– Use partial orders to construct incrementally the final

order

• Standard assumptions:– Monotone scoring function– Each criterion is evaluated as a [0,1] number

(normalization)

• Optimized for the case of fast sorted access for each criterion

Page 10: Towards a Top-K SPARQL Query Benchmark Generator

Top-k SPARQL queriesE.g., the 10 most recent books written by the youngest authors

SELECT ?book ?author

(0.5*norm(?releaseDate) +

0.5*norm(?dateOfBirth) AS ?s )

WHERE {

?book dbp:isbn ?v .

?book dbp:author ?author .

?book dbp:releaseDate ?releaseDate .

?v3 dbp:dateOfBirth ?dateOfBirth .

}

ORDER BY DESC(?s)

LIMIT 10

Scoring Functionas a SELECT expression

Normalization cast the value in [0..1]

norm(x) = x - minx

maxx - minx

Order and slice 10

Page 11: Towards a Top-K SPARQL Query Benchmark Generator

The ProblemSet up a benchmark for top-k SPARQL Queries that• Resembles reality• Stresses the features of top-k queries

– Syntax: SELECT expression + ORDER BY + LIMIT – Performance: hit SPARQL engine where it hurts

11

Page 12: Towards a Top-K SPARQL Query Benchmark Generator

Jim Gray on BenchmarkingPrinciples

• Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain

• Portable: Easy to implement on many different systems

• Scalable: Applies to small and large computer systems

• Simple: understandable

Results

12

Page 13: Towards a Top-K SPARQL Query Benchmark Generator

DefinitionsE.g., the 10 most recent books written by the youngest authors

releaseDate

Rankable Variables

Scoring Variables

Rankable Data Properties

Rankable Triple Patterns

Scoring Function

0.5* norm(?releaseDate) + 0.5*norm(?birthDate)

?book

?author

?releaseDate

dateOfBirth?birthDate

aut

ho

rT

riple

Pat

tern

s

13

Page 14: Towards a Top-K SPARQL Query Benchmark Generator

Research Hypothesis• H.0 top-k SPARQL queries that resemble reality can

be obtained extending DBpedia SPARQL Benchmark– H.1 ++ Rankable variable ++ execution time

– H.2 ++ Scoring variable ++ execution time

– H.3 +/- LIMIT = execution time

14

Page 15: Towards a Top-K SPARQL Query Benchmark Generator

DBpedia SPARQL Benchmark• A method to

generate a SPARQL benchmark from DBpedia an its query longs

• It can be applied to other datasets and other query logs

• Characteristics– Resemble reality– Stress SPARQL

features

Query Logs

Query Analysis and Clustering

Dataset generation

Auxiliary Queries

Queries Templates

Query Instances

15

Page 16: Towards a Top-K SPARQL Query Benchmark Generator

Proposed SolutionTop-k DBPSB• An extension of DBPSB

Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables

• It is also a method– Can be applied to other

benchmark obtained using DBSBM method

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

16

Page 17: Towards a Top-K SPARQL Query Benchmark Generator

A DBPSB Auxiliary QuerySELECT DISTINCT ?v

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:author ?v3 .

}

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

17

Page 18: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 1aTo generate queries with 1 rankable variable

SELECT ?p (COUNT(?p) AS ?n)WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime)} ORDER BY ORDER BY DESC(?n)

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

18

Page 19: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 1bResults – not all sortable properties resemble reality• Pages• ISBN• NumberOfPages• Year• Volume• wikiPageID• releaseDate• …

NOTE: it requires manual selection

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

19

Page 20: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 1cTo generate queries with 2 rankable variables

SELECT ?p ?p1 (COUNT(?p1) AS ?n)

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:author ?v3 .

?v6 ?p ?o .

?o ?p1 ?o1 .

FILTER(isNumeric(?o1) ||

datatype(?o1)=xsd:dateTime) }

GROUP BY ?p ?p1

ORDER BY DESC(?n)

NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

20

Page 21: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 1dResults• author, wikiPageID• author, wikiPageRevisionID• …• author, dateOfBirth• …• publisher, wikiPageID• publisher, wikiPageRevisionID• …• publisher, founded • …

NOTE: it requires manual selection

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

21

Page 22: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 2SELECT (max(?o) as ?max) (min(?o) as ?min)

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:author ?v3 .

?v6 dbp:pages ?o .

FILTER(isNumeric(?o) ||

datatype(?o)=xsd:dateTime)

}

NOTE: the filter clause should not be necessary, but DBpedia is very dirty …

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

22

Page 23: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 3• Choose the number of ranking variables

– Max three– E.g., books and authors

• Choose the number of scoring variables per ranking variables– Max three– E.g., releaseDate for books and dateOfBirth for authors

• Look up the min and the max of each ranking variable to normalise it

• Choose the weights– The sum of the weight should be 1

• Assemble the scoring function– E.g., 0.5*norm(?releaseDate ) +

0.5*norm(?dateOfBirth)

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

23

Page 24: Towards a Top-K SPARQL Query Benchmark Generator

Top-k DBPSB step 4SELECT ?v6 ?v3

(0.5*norm(?o1) + 0.5*norm(?o2) AS ?s )

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:author ?v3 .

?v6 dbp:releaseDate ?o1 .

?v3 dbp:dateOfBirth ?o2 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime)

FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime)

}

ORDER BY ?s

LIMIT 10

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

24

Page 25: Towards a Top-K SPARQL Query Benchmark Generator

Preliminary Results 1/2• We tested our hypothesis using

– Virtuoso Open-Source Edition version 6.1.6 – Jena-TDB Version 2.10.1 – DBpedia 10%

• In this setting, Top-k DBPSB generates queries– adequate to test

• H.2 ++ Scoring variable ++ execution time

• H.3 +/- LIMIT = execution time

– only partially adequate to test • H.1 ++ Rankable variable ++ execution time

25

Page 26: Towards a Top-K SPARQL Query Benchmark Generator

Preliminary Results 2/2• H.1 ++ Rankable variable ++ execution time

– confirmed in some cases

– not confirmed aggregating by query across engine

– confirmed aggregating by engine across queries

• H.2 ++ Scoring variable ++ execution time– confirmed for Jena TDB

– confirmed in most of the cases for Virtuoso

• H.3 +/- LIMIT = execution time– confirmed for Jena TDB

– confirmed in most of the cases for Virtuoso

26

Page 27: Towards a Top-K SPARQL Query Benchmark Generator

Conclusions• Top-k DBPSB is a successful first attempt to

automatically generate Top-k SPARQL queries that– Resemble reality– Hit SPARQL engines where it hurts

• More investigation is required– Better understand the relationships between the number of

rankable variable and the execution time• E.g., cardinalities, selectivity and jooins

– Include over known features of top-k query that impact execution time

• E.g., correlation of order induced on the result set by the different scoring variable in the scoring function

• E.g., Distribution of values matched by the scoring variables

27

Page 28: Towards a Top-K SPARQL Query Benchmark Generator

Thank you! Any Question?

Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2

1Politecnico di Milano 2TU Delft

Page 29: Towards a Top-K SPARQL Query Benchmark Generator

Preliminary Results - details