Searching Linked Data

34
Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden- Wuerttemberg and National Laboratory of the Helmholtz Associat Searching Linked Data From Finding Relevant Sources to Computing Answers Invited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010. Thanh Tran , Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT 1

description

Searching Linked Data - From Finding Relevant Sources to Computing Answers Invited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010.

Transcript of Searching Linked Data

Page 1: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association1

Searching Linked DataFrom Finding Relevant Sources to Computing AnswersInvited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010.

Thanh Tran, Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT

Page 2: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

2

Page 3: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Linked Data

- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links- As of 09-2010 + other linked data not covered by LOD cloud

3

More Data

More Links

Page 4: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Opportunities

4

“Articles from awarded researchers at Stanford ”

Freebase contains data about people DBPedia contains information about awards DBLP contains bibliographic data

More Data

More Links

More complex information needs More precise results More integrated results

Page 5: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Problems“Articles from awarded researchers at Stanford ”

z) n(x,publicatio Stanford) name(y, y) worksAt(x, Award) Turing prizes(x,.,).( yxz

Formulating queries is a hard task!• Which data sources?• Which schema elements?

Processing queries is expensive!• Process against all data sources? • Explore all links to other sources?

Large number of unknown, unexplored & irrelevant sources! What is in there? What is out there? What is relevant?

USABILITY SCALABILITY

5

Page 6: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Searching Linked Data

Given the needs (expressed as sets of keywords), are there answers in linked data? what combination of data sources produce them? how to incorporate related unexplored linked sources?

6

Identify valid combination of sources

Identify schema elements

Let user choose combination of sources

Focus on this combination of sources and explore related linked sources

Keyword Query Routing to Relevant Linked Data Sources

Focused, Adaptive and Stream-based Linked Data Query Processing (c.f. LARKC)

Page 7: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining top-down & bottom-up

Stream-based query processing

Corrective source ranking

Conclusions

7

Page 8: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

LOD Data Graph

8

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Web data modeled as a set of interlinked data graphs Each data graph represent a source Data graph vs. schema graph vs. source graph

Page 9: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

LOD Schema Graph

9

Author

University

Person Person Prize

authoremploy

sameAs sameAs prizes

Written Work

author

Article

Web data modeled as a set of interlinked data graphs Each data graph represent a source Data graph vs. schema graph vs. source graph

DBLPFreebase DBPedia

Page 10: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

LOD Source Graph

10

Web data modeled as a set of interlinked data graphs Each data graph represent a source Data graph vs. schema graph vs. source graph

DBLPFreebase DBPedia

sames sameAs

author

Page 11: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Keyword Query Answers

11

), dD,Q,F,R(q ji

User information need award“„stanford article

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Article

type

Page 12: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Problem Definition

Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path.

12

d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.

Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its union set of sources produces non-empty keyword query results.

Page 13: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

A Valid Keyword Routing Plan

13

), dD,Q,F,R(q ji

User information need award“„stanford article

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name name label

employ

sameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Article

type

Page 14: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

14

Page 15: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Keyword Sets

16

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turing

Award

Smith Music

One keyword set for every data source Elements stand for distinct keywords mentioned in a source

Page 16: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Element-level Keyword-Element Relationship Graph (E- KERG)

17

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

A keyword-element captures a keyword k and the data element mentioning k A relationship between two keyword-elements exists iff there is a path between

their associated data elements In d-max KERG, the paths to be considered have length d-max or less

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

Page 17: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Schema-level Keyword-Element Relationship Graph (S-KERG)

18

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k

A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements

Groups ele. (rel.) when they capture same keyword (rel. between same classes)

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

University Person Author

Article Person Prize

Page 18: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Data-Source-level Keyword-Element Relationship Graph (D-KERG)

19

per1

uni1

Stanford University

per2

JohnMcCarthy

JohnMccarthy

per3 prize1

Turing Award

JohnMcCarthy

author

name name name label

employsameAs sameAs prizes

DBLPFreebase DBPedia

pub2

author

pub1 pub3

…John.

title

per4 prize2author

JohnSmith

Music Award

name label

prizes

Stanford

University

John

McCarthy John

McCarthy

McCarthy

John

Turin

Award

Smith Music

A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k

A relationship between two keyword-elements exists if there is a path between some instances of their associated sources

Groups ele. (rel.) when they capture same keyword (rel. between same sources)

uni1 per2 per1 per3 prize1

per4

John

prize2

Award

John

pub4

University Person Author

Article Person Prize

Page 19: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

21

Page 20: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Experiments

Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings

22

Manually crafted 30 keyword valid multi-data-source queries, i.e., produce non-empty keyword answers and involve more than 2 sources Town River America Beijing Conference Database 2007

Page 21: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Validity

P@k measure the percentage of plans that are valid out of the top-k plans P@5 for KS only 6%, P@5 up to 100% for E-KERG (dmax =4) More valid plans were computed when a higher value was used for dmax

dmax =3 seems to be a good tradeoff Queries with larger number of keywords resulted in lower precision

23

2 3 4 50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 E-KERG D-KERG

S-KERG KS

|K|

P@5

0 1 2 3 40.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 E-KERG

D-KERG

S-KERG

KS

dmax

P@5

Page 22: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Performance

24

Times increased with higher values for dmax

Sharp for E-KERG and S-KERG Relatively stable for D-KERG

Times increase with number of keywords All other models had poor performance w.r.t complex queries but D-KERG E-KERG needed more than 100s for queries with more than 2 keywords

Time for D-KERG was no more than 10ms on average

0 1 2 3 41

10

100

1000

10000

100000

1000000

S-KERG D-KERG KS E-KERG

dmax

Que

ry P

roce

ssin

g Ti

me

(ms)

2 3 4 51

10

100

1000

10000

100000

1000000

S-KERG D-KERG KS E-KERG

|K|

Que

ry P

roce

ssin

g Ti

me

(ms)

Page 23: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

27

Page 24: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Mixed Query Processing Strategy

Combination of top-down and bottom-up strategies Top-down: partial local index of sources, not assumed to

be complete Bottom-up: new sources are discovered at run-time

Corrective Source Ranking Deal with heterogeneous source descriptions Adaptive re-ranking

Stream-based Query Processing Deal with unpredictable nature of Linked Data access

ISWC 2010, Shanghai, China

Page 25: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

29

Page 26: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Query Plan

Source Retrieval

Stream-based Query Processing

Compile-time Construct query plan Probe local index for

sources Network latency

Do not block! Evaluation driven by

incoming data

Run-time Retrieve sources Push data into query plan Discover new sources Rank sources

ISWC 2010, Shanghai, China

Join

Join

worksAt(?x, dbpedia:KIT) knows(?x, ?y)

name(?y, ?n)

Results

Source Retriever 1

Source Retriever 2

...

Push

Source RankerRetrievesource

Sourcediscovered

Source 1 (score: 1.0)Source 2 (score: 0.7) ...

Samples

Local source index

Linked Data

Page 27: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

31

Page 28: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Corrective Source Ranking

Prefer more relevant sources Relevancy of a source is based on

Current query Any available intermediate results Overall optimization goal

Define a set of source features and derive concrete source metrics Not all metrics are available for all sources (heterogeneity)

Refine previously computed metrics using newly discovered information (intermediate results, samples)

ISWC 2010, Shanghai, China

Page 29: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Evaluation

Three systems: top-down (TD), bottom-up (BU), mixed (MI) 8 queries over various datasets (DBpedia, Geonames, NYT) To make the approaches comparable, sources were restricted to

those discoverable by the BU approach ~6200 sources, containing ~500k triples

Sources hosted on local proxy server with artificial delay of 2 seconds 25% of sources were randomly chosen to construct index for MI

ISWC 2010, Shanghai, China

Page 30: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Results

ISWC 2010, Shanghai, China

Query 1 Query 6

BU MI TD BU MI TD

25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0

50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0

Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0

Src. Selection 0.0 853.0 1444.5 0.0 1331.0 1863.5

Ranking 25.5 2404.0 411.5 23.5 292.5 335.0

Overall early result reporting25% results: MI 8.7s, BU 15.1s50% results: MI 12.8s, BU 22.0sImprovement of ~42%

Detailed results for two queries:

Page 31: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Result Arrival Times

ISWC 2010, Shanghai, China

Page 32: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Agenda

Searching Linked Data

Opportunities & challenges

Keyword Query Routing

Problem Definition

Summary Models

Experiments

Linked Data Query Processing

Combining Top-down & Bottom-up

Stream-based Query Processing

Corrective Source Ranking

Conclusions

39

Page 33: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Conclusions

40

Keyword query routing Helps users without knowledge of linked data and schemas to

find combination of sources that contain answers corresponding to their needs

Focus on relevant combinations Summarizing at the level of sources (D-KERG) represents the

most practical trade-off, produces results in less than 10ms out of which every second one was valid

Stream-based query processing helps to deal with unpredictable nature of Linked data

Corrective, mixed strategy that incorporate new sources and knowledge at run-time for optimization (source ranking) helped to report early results 42% faster on average

Page 34: Searching Linked Data

Thanh Tran, AIFB Institute, KIT, [email protected] KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association

Thanks for Your Attention!

Institute AIFB, KIT

[email protected]

41