Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword...

65
Structure • Query Processing – Data models – Query models – Approaches – Challenges • Keyword query processing on RDF • Structured query processing on RDF • Structured query processing on the Web – Routing needs to linked data sources – Linked data query processing

Transcript of Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword...

Page 1: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Structure• Query Processing– Data models– Query models– Approaches– Challenges

• Keyword query processing on RDF• Structured query processing on RDF• Structured query processing on the Web– Routing needs to linked data sources– Linked data query processing

Page 2: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing

Page 3: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing

Query

Data

Mat

chin

g

Page 4: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Data / Data Models• Textual

– Bag-of-words– Represent documents, text in structured data,…, real-world

objects (captured as structured data)– Miss “structured information”

• in text, e.g. linguistic structure, hyperlinks, (positional information)• in structured data

In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.

combination Cloud Computing Technologiessolutions management `big data' industry solutions support complex ……

term (statistics)

Page 5: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Data / Data Models• Textual• Structured

– Resource Description Framework (RDF) – Represent real-world objects, services, applications, …. documents– Resource attribute values and relationships between resources– Schema

Page 6: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Data / Data Models• Textual• Structured• Hybrid– Textual and structured data

Page 7: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Unstructured• Fully-structured• Hybrid: unstructured + structured

Page 8: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Unstructured– NL– Keywords book price 30

Page 9: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Unstructured• Fully-structured– SQL: select, from, where• SELECT title, price

FROM BooksWHERE Price < 30

Page 10: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Unstructured• Fully-structured– SQL: select, from, where– SPARQL: BGP, filter, optional, union, select, construct,

ask, describe • PREFIX dc: <http://purl.org/dc/elements/1.1/>

PREFIX ns: <http://example.org/ns#> SELECT ?title ?price WHERE { ?x dc:title ?title . OPTIONAL { ?x ns:price ?price . FILTER (?price < 30) } }UNION { ?book dc11:title ?title . ?book dc11:creator ?author } }

Page 11: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Unstructured• Fully-structured– SQL– SPARQL– Conjunctive queries, e.g., graph patterns (BGP)

Page 12: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Fully-structured• Unstructured • Hybrid: content and structure constraints

Page 13: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query / Query Models• Fully-structured• Unstructured • Hybrid: content and structure constraints

Page 14: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing• Matching queries against data

Page 15: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Approaches – Taxonomy (1)

Query

Data

Mat

chin

g• Complete • Sound

• Approximate• Not complete• Not sound

• Ranked• Best effort • Top-k

Query processing focuses on efficiency whereas ranking deals with result quality!

Page 16: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Approaches – Taxonomy (2)

Keyword query on textual data (Standard IR)

Keyword query on structured

data

Structured query on textual data

Structured query on structured

data (standard DB)

Hybrid query (XML IR)Unstructured Query

StructuredQuery

Textual Data

Structured Data

Page 17: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Query / Textual Data• Retrieve documents• Inverted list (inverted index)

keyword {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...}

• AND-semantics: top-k join

= =

Page 18: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Structured Query / Structured Data• Retrieve data for triple patterns

• Index on tables• Multiple “redundant” indexes to cover different access patterns

• Join (conjunction of triples)• Blocking, e.g. linear merge join (required sorted input)• Non-blocking, e.g. symmetric hash-join• Materialized join indexes

SP-index PO-index

==

=

Page 19: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Query / Structured Data• Retrieve keyword elements

• Using inverted indexkeyword {<el1, score, ...>, <el2, score, ...>,…}

• Exploration / “Join”• Data indexes for triple lookup• Materialized index (paths up to graphs)• Top-k Steiner tree search, top-k subgraph exploration

↔ ↔

==

Page 20: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

References

• Günter Ladwig, Thanh Tran: Combining Query Translation with Query Answering for Efficient Keyword Search. ESWC 2010:288-303

• Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano: Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. ICDE 2009:405-416

• Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou: EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. SIGMOD 2008:903-914

• Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer: Ontology-Based Interpretation of Keywords for Semantic Search. ISWC/ASWC 2007:523-536

• Hao He, Haixun Wang, Jun Yang, Philip S. Yu: BLINKS: ranked keyword searches on graphs. SIGMOD 2007:305-316

• Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan, Rushi Desai, Hrishikesh Karambelkar: Bidirectional Expansion For Keyword Search on Graph Databases. VLDB 2005:505-516

Page 21: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Structured Query / Textual Data• Based on offline IE (offline see Peter’s slides)• Based on online IE, i.e., “retrieve “ is as follows

• Derive keywords to retrieve relevant documents• On-the-fly information extraction, i.e., phrase pattern matching “X title Y”• Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.

• Index• Inverted index for document retrieval and pattern matching• Join index inverted index for storing materialized joins between keywords• Neighborhood indexes for phrase patterns

Hybrid case

Page 22: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

References• Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni:

Structured Querying of Web Text Data: A Technical Challenge. CIDR 2007:225-234

• S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, pages 717–726, 2006.

• S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti: Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1), 2008.

• M. J. Cafarella. Extracting and querying a comprehensive web database. In CIDR, 2009.

• G. Ramakrishnan, S. Balakrishnan, and S. Joshi. Entity annotation using inverse index operations. In EMNLP, 2006.

• M. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2006.

Page 23: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing – Main Tasks• Retrieval– Documents , data elements, triples,

paths, graphs– Inverted index,…, but also other

indexes (B+ tree)– Index documents, triples materialized

join paths• Join– Different join implementations,

efficiency depends on availability of indexes

– Non-blocking join good for early result reporting and for “unpredictable” linked data scenario

Query

Data

Mat

chin

g

Page 24: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing – More Tasks• Disjunction, aggregation, grouping• Join order optimization• Approximate

– Approximate the search space – Retrieve only some results– Approximate the join

• Parallelization• Top-k

– Use only some entries in the input streams to produce k results

• Multiple sources– On-the-fly mapping, similarity join – Federation, routing

• Hybrid– Join text and data

Query

Data

Mat

chin

g

Page 25: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing on the WebResearch Challenges and Opportunities

• Large amount of semantic data

• Data inconsistent, redundant, and low quality

• Large amount of data embedded in text

• Large amount of sources

• Large amount of links between sources

• Optimization parallelization,

• Approximation • Hybrid querying and data

management• Federation, routing• Online schema mappings• Similarity join

Page 26: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Approaches

Keyword query on textual data (Standard IR)

Keyword query on structured

data (IR-DB)

Structured query on textual data

(DB – IR)

Structured query on structured

data (standard DB)

Routing, Approximation,

Adaptive Optimization

Search SpaceApproximation

Unstructured Query

StructuredQuery

Textual Data

Structured Data

Page 27: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Query Processing onGraph-Structured RDF Data

Page 28: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Search in DBs / Keyword Translation (Kacholia et al., VLDB05)

), dD,Q,F,R(q ji

User information need turing award“„stanford article

Translation

Specification

28

• Keywords might produce large number of matching elements in the data graphs

• The data graphs might be large in size• Search complexity increases substantially

with the size of the data graphs• Large number of results

Page 29: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Space (Tran et al., ICDE2009)Schema graph derived from data graph Query space = connecting keyword elements with schema elements

• Main Idea– Query space: more compact representation of the data graph

• Online construction of query space out of schema graph– Match keywords against labels of resources to find keyword elements– Connect keyword elements with elements of schema graph to obtain query space

• Online top-k query graph exploration

Exploration on much reduced summary model called query space

Substantially decrease complexity Top-k procedure for graph exploration to compute

only the top-k most relevant results

Page 30: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Top-k Query Graph Exploration on Query SpaceQuery space, three paths from keyword matching elements, and costs of elements

• Cost-directed exploration of minimal Steiner graphs• Explore all possible distinct paths starting from keyword elements• At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to obtain a candidate• Top-k terminates when

• highest cost in the candidate list (the cost of the k-ranked query graph) < lowest possible cost that can achieved with paths in the queues

Page 31: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Structured Query Processing onGraph-Structured RDF Data

Page 32: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing• Structured query: conjunctive queries– Conjunctive queries on graph-structured data

amounts to the task of graph-pattern matching

32

A solution for determining matching requires exponential time

Search complexity increases substantially with the size of the graph

The size of the graph is very large on the Web of linked data

Page 33: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Answer Space (Tran et al., SemData@VLDB 2010) An extended example of the data graph The resulting answer space

• Construction of answer space is based on bisimulation • Answer space

– Comprises of classes (extensions) and relations between them– Resources in an extension exhibit the same structure, i.e., have the

same (incoming and outgoing) paths – Is a structural description more fine-granular then a schema

Summary model for general data graphs Structure-based data partitioning to store data that share

structures Structure-aware processing to filter candidates and prune

queries using a smaller answer space

Page 34: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Structural-aware Matching Using Answer Space The answer space An example query

• Match query against answer space– Answer space matches contain elements satisfying the query structure

• Focus on answer spaces matches to compute final answers– Prune query parts containing non-distinguished variables only– Match remaining query against data graph (i.e., focus on elements in

the answer space matches identified and loaded before)

• Advantages: reduction in IO cost and number of union & joins

Page 35: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing on the Linked Data Web

Page 36: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Processing on the Web• Routing

• Find combinations of sources• Federation

• Query parts sources• Combining results from different sources

• Online schema mappings• Similarity join

Page 37: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Linked Data

- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links- As of 09-2010 + other linked data not covered by LOD cloud

37

More Data

More Links

Page 38: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Challenges“Articles from awarded researchers at Stanford ”

z) n(x,publicatio Stanford) name(y, y) worksAt(x, Award) Turing prizes(x,.,).( yxz

Formulating queries is a hard task!• Which data sources?• Which schema elements?

Processing queries is expensive!• Process against all data sources?

• Large number of unknown, unprocessed & irrelevant sources!– What is in there?– What is out there?– What is relevant?

USABILITY SCALABILITY

Page 39: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Searching Linked Data

• Given the needs (expressed as sets of keywords), – are there answers in processed linked data?– what combination of data sources produce them?– how to incorporate related unprocessed linked sources?

40

Identify valid combination of sources

Identify schema elements

Let user choose combination of sources

Focus on this combination of sources and related linked sources

Keyword Query Routing

Linked Data Query Processing

Page 40: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Query Routing(Tran et al., ISWC 2010)

Page 41: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Query Routing• Linked data (schema and data are linked)• Routing based on keywords

• Find combinations of sources

Page 42: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

LOD Data Graph

43

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

• Web data modeled as a set of interlinked data graphs• Each data graph represent a source• Data graph vs. schema graph vs. source graph

Page 43: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

LOD Schema Graph

44

Author

University

Person Person Prize

authoremploy

sameAs sameAs prizes

Written Work

author

Article

• Web data modeled as a set of interlinked data graphs• Each data graph represent a source• Data graph vs. schema graph vs. source graph

DBLPFreebase DBPedia

Page 44: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

LOD Source Graph

45

• Web data modeled as a set of interlinked data graphs• Each data graph represent a source• Data graph vs. schema graph vs. source graph

DBLPFreebase DBPedia

sames sameAs

author

Page 45: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Query Answers

46

), dD,Q,F,R(q ji

User information need award“„stanford article

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Article

type

Page 46: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Problem Definition

• Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path.

d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.

Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources can be combined to produce non-empty keyword query results.

Page 47: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

A Valid Keyword Routing Plan

48

), dD,Q,F,R(q ji

User information need award“„stanford article

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Article

type

Page 48: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

The Search Space• Multi-level inter-relationship graphs capture the entire search space• Relationships between elements• and between different levels

49

A solution: apply existing approaches to keyword search for computing Steiner graphs Steiner graphs might span several linked sources Search space grow exponentially with the number of

sources and their associated links Search space is too large!

Page 49: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Keyword Sets

50

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turing

Award

Smith Music

• One keyword set for every data source• Elements stand for distinct keywords mentioned in a source

Page 50: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Element-level Keyword-Element Relationship Graph (E- KERG)

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

• A keyword-element captures a keyword k and the data element mentioning k• A relationship between two keyword-elements exists iff there is a path between

their associated data elements• In d-max KERG, the paths to be considered have length d-max or less

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

Page 51: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Schema-level Keyword-Element Relationship Graph (S-KERG)

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

• A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k

• A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements

• Groups ele. (rel.) when they capture same keyword (rel. between same classes)

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

University Person Author

Article Person Prize

Page 52: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Data-Source-level Keyword-Element Relationship Graph (D-KERG)

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

• A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k

• A relationship between two keyword-elements exists if there is a path between some instances of their associated sources

• Groups ele. (rel.) when they capture same keyword (rel. between same sources)

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

University Person Author

Article Person Prize

Page 53: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Routing Plan Computation

• Keyword sets– Retrieve elements for every keyword k in K– Retrieve associated sources and put them into SK

– Compute all |K|-combinations of SK (KRPs)

54

KERG models Compute all 2-combinations of K to get all keyword pairs Retrieve matching KERG relationships for each pair and

join them to produce matching subgraphs (KRPs)

Page 54: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Mixed, Corrective and Stream-based Linked Data Query Processing

Page 55: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Structured Query Processing on the Web• Linked data (schema and data are linked)• Federation

• Query parts to sources• Combining results from different sources

• Exploration• Mixed

Page 56: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Top-down Query Evaluation (Harth et al., WWW 2010)

• Local index of sources, assumed to be complete– Used for source selection– Maps triple and join patterns to source URIs

• Statistics for ranking of sources and query optimization– Performed once at compile-time– Only a fixed number of top-ranked sources is considered

• No run-time discovery• Fast, only relevant sources are retrieved• Not up-to-date• Index size may become very large

Page 57: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Bottom-up Query Evaluation (Hartig et al., ISWC 2009)

• Sources discovered at run-time through links from other, already retrieved sources

• No local index of sources• Slower, as unnecessary sources are retrieved• Always up-to-date

Page 58: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Mixed Strategy(Ladwig et al. ISWC 2010)

• Combination of top-down and bottom-up strategies– Partial local index of sources, not assumed to be

complete– New sources are discovered at run-time

• Addresses volume and dynamic of Linked Data• Corrective Source Ranking– Deal with heterogeneous source descriptions

• Stream-based Query Processing– Deal with unpredictable nature of Linked Data access

Page 59: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Query Plan

Source Retrieval

Stream-based Query Processing

• Compile-time– Construct query plan– Probe local index for

sources• Network latency

– Do not block!– Evaluation driven by

incoming data

• Run-time– Retrieve sources– Push data into query plan– Discover new sources– Rank sources

Join

Join

worksAt(?x, dbpedia:KIT) knows(?x, ?y)

name(?y, ?n)

Results

Source Retriever 1

Source Retriever 2

...

Push

Source RankerRetrievesource

Sourcediscovered

Source 1 (score: 1.0)Source 2 (score: 0.7) ...

Samples

Local source index

Linked Data

Page 60: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Push-based Symmetric Hash Join

• Operation– Maintains a hash table for each input– Tuples are inserted into one hash

table and then the other is probed for join combinations

• Results reported as soon as input tuples arrive

• Tuples can arrive on all inputs in any order

• Push-based– Tuples are pushed into operators

from the leaves to the root of the query plan

– Execution driven by incoming tuples instead of results

Key T

a t1, t3

b t2

Key T

b t4, t5

c t6

Left input Right input

Pushed on left: t7(b)

InsertProbe

Push output

t7t4

t7t5

Key T

a t1, t3

b t2, t7

Page 61: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Corrective Source Ranking• Prefer more relevant sources• Relevancy of a source is based on– Current query– Any available intermediate results– Overall optimization goal

• Define a set of source features and derive concrete source metrics– Not all metrics are available for all sources

(heterogeneity)• Refine previously computed metrics using newly

discovered information

Page 62: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Source Features and Metrics– Source is more relevant if it contains data that

contributes to answers of the query– Triple Pattern Cardinality

– Join Pattern Cardinality

Page 63: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Metric Correction and Refinement• During query processing new information becomes available:

intermediate join results, links– Refine and correct previously computed metrics– Important in the case of non-discriminative patterns

• Instantiate triple pattern of a join with samples of intermediate results to obtain better join size estimates

• Example

Intermediate results in SHJ operatorPerform triple pattern

cardinality lookupsSample

Page 64: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

Conclusions• Query processing: which kinds of data, queries?– Focus: textual & structured queries and semantic data

• Web of linked data creates opportunities and challenges– Optimization – Approximation – Routing– Top-k… and ranking

• Web is linked data + a large amount of text– Hybrid management & integrated search

Page 65: Structure Query Processing – Data models – Query models – Approaches – Challenges Keyword query processing on RDF Structured query processing on RDF Structured.

References• Thanh Tran, Günter Ladwig: Structure Index for RDF.

SemData@VLDB 2010• Thanh Tran, Lei Zhang, Rudi Studer: Routing Keywords

to Linked Data Sources, ISWC 2010• Günter Ladwig, Thanh Tran: Linked Data Query

Processing Strategies, ISWC 2010• Andreas Harth, Katja Hose, Marcel Karnstedt, Axel

Polleres, Kai-Uwe Sattler, Jürgen Umbrich: Data summaries for on-demand queries over linked data. WWW 2010:411-420

• Olaf Hartig, Christian Bizer, Johann Christoph Freytag: Executing SPARQL Queries over the Web of Linked Data. ISWC 2009:293-309