Graphinder Semantic Search Relational Keyword Search over Data Graphs
description
Transcript of Graphinder Semantic Search Relational Keyword Search over Data Graphs
Graphinder Semantic SearchRelational Keyword Search over Data Graphs
Thanh Tran, Lei Zhang, Veli Bicer, Yongtao MaResearcher: www.sites.google.com/site/kimducthanhCo-Founder: www.graphinder.com
Agenda• Introduction• Graphinder: Overview • Keyword Query Translation• Keyword Query Result Ranking• Keyword Query Rewriting
– Suggesting correct and meaningful queries– Auto-complete as user types
INTRODUCTION
Motivation: lots of structured data
Semantic Search: use information about entities and relationships explicitly given in structured data to provide relevant answers for complex questions asked using intuitive interfaces
<x, type, Single> <Freddie Mercury, writer, x><Freddie Mercury, type, Artist><Freddie Mercury, member, Queen><Queen, type, Band>
<x, type, Single> <x, wrritenBy, Freddy>
MusicBrainz
DBpedia
<Freddy, same-as, Freddy Mercury> Links
“single written by freddie queen”
“singles written by freddie, who is member of the band queen”
Freddie Mercury
BrianMay
QueenQueen
Elizabeth 1
Liar 1971 single
PersonArtist Single
member
mem
ber producer
formed in
marital
status
writer
Entity Semantic Search: find relevant entity, return structured data summary, facts, related entities
Relational Semantic Search: find relevant entities involved in a relationship, return entity summaries…
Semantic Search Problem: understand user inputs as entities and relationships and find relevant answers
“single written by freddie queen”
“singles written by freddie, who is member of the band queen”
Query Translation: What are possible connections (schema-level) between recognized entities and relationships?
1)<x, type, Single> <Freddie Mercury, writer, x><Freddie Mercury, member, Queen>2) …. Query Answering: What are actual connections (data-level) between recognized entities and relationships?
1)<Liar Liar, type, Single> <Freddie Mercury, writer, Liar Liar><Freddie Mercury, member, Queen>2)…
Freddie Mercury
BrianMay
QueenQueen
Elizabeth 1
Liar 1971 single
PersonArtist Single
member
mem
ber
producer
formed inm
arital
status
writer
Relational Semantic Search at Facebook: recognizes entities and relationships via LMs, uses manually specified template (grammar) to find possible connections between them and computes answers via resulting translated queries
“my friends, who is member of queen”
{band}[id:Queen1]
Queen1
queen
[member-of-v]is member of
member()
member
[member-vp]is member of [id:1]member(x,Queen1)
[who]who
-
friends
[user-filter]who is member of [id:1]
member(x,Queen1)
[start]my friends, who is member of [id:Queen1]
friends(x,me), member(x,Queen1)
[user-head]my friends
friends(x,me)
Grammar: set of production rules, capturing all possible connections, i.e. the search space of all parse trees
[start] [users] [users] my friends friends(x, me)[…] is member of [bands] member(x, $1)[bands] {band} $1…
Grammar-based Query Translation: which combination of production rules results in a parse tree that connects the recognized entities and relationships?
OVERVIEW
Graphinder Semantic Search: a translation-based approach for relational keyword search over data graphs
Sem. Auto-completion
- Entity + Relationships - Multi-source- Domain-independent- Low manual effort
Freddie Mercury BrianMay
Queen
Queen Elizabeth 1
Liar 1971 single
PersonArtist Single
member m
embe
r
producer
formed in
marital
status
writer
Query Translation
Graphinder: selected publications• On-demand, domain-independent, relational keyword search
over data graphs– Structure index for data graphs (TKDE13b)– Top-k exploration of translation candidates (ICDE09)– Index-based materialization of graphs (CIKM11a)– Ranking results using structured relevance model (SRM) (CIKM11b)
• Multi-source– Deduplication using inferred type information: TYPifier (ICDE13),
TYPimatch (WSDM13)– On-the-fly deduplication using SRM (WWW11)– Ranking with deduplication (ISWC13)– Routing keyword queries to relevant data graphs (TKDE13a)– Hermes: keyword search over heterogeneous data graphs (SIGMOD09)
• Semantic auto-completion – Computing valid query rewrites for given keywords (VLDB14)
QUERY TRANSLATION
0) Query Translation: constructing pseudo schema graph representing all possible connections between data elements
• Structure index for data graph: nodes are groups of data elements that are share same structure pattern
• Parameters: structure pattern with edge labels L and paths of maximum length n
• Pseudo schema– Node groups all instances that have
same set of properties– structure pattern: all properties, i.e.
all outgoing paths with n = 1, L = all edge labels
• Algorithm:– Start with one single partition/node
representing all instances– Spit until all nodes are “stable”, i.e.,
all contained instances share same structure pattern
Freddie Mercury
BrianMay
QueenQueen
Elizabeth 1
Liar single
PersonArtist Single
member
mem
ber producer
marital
status
writer
PersonArtist Thing12 Single Value2
member producer writer marital status
1) Query Translation: constructing search space representing all possible interpretations of query keywords
Freddie Mercury
Queen Queen Elizabeth 1
single
PersonArtist Band Single Literal
member producer writer marital status
Freddie Mercury Queen
Queen Elizabeth 1 single
Singlewriter
“written by freddie queen single”
Data Index
SchemaIndex
Keyword Interpretation: use inverted index and LM-based ranking function to return relevant schema and data elements
Search Space Construction: augment pseudo schema with query-specific keyword matching elements • All possible connections of predicates
applicable to recognized query keywords
Top-k Subgraph Exploration
Result Retrieval & Ranking
2) Query Translation: score-directed algorithm for finding top-k subgraphs connecting keyword matching elements
Freddie Mercury
Queen Queen Elizabeth 1
single
PersonArtist Band Single Literal
member producer writer marital status
“written by freddie queen single”
<x, type, Single> <Queen, producer, x><Freddie Mercury, writer, x><Queen, type, Band><Freddy Mercury, type, Artist>
• Algorithm: score-directed top-k Steiner graph search• Start: explore all distinct paths starting from keyword elements• Every iteration
• One step expansion of current path with highest score• When connecting element found, merge paths and add resulting graph to list
• Top-k termination: lowest score of the candidate list > highest possible score that can achieved with paths in the queues yet to be explored
• Termination: all paths of maximum length d have been explored• Final step: mapping rules to translate Steiner graph to structured query
RESULT RANKING
Ranking Using Structured LMs: Keyword query is short and ambiguous, while structured data provide rich structure information: ranking based on LMs capturing both content and structure
• Structured LMs for structured results r
• Structured LM for queries using structured pseudo-relevant feedback results FR (relevance model)
• Compute distance between query and result LMs
)|()( rvPvRM r
)|()( rF FvPvRMr
Vv
rF vRMvRMrScorer
)(log)()(
Relevance Models
F Documents
Candidate Documents
Query
• Term probabilities of query model is based on documents
• Ranking behaves like similarity search between pseudo-relevant feedback documents and corpus documents
freddie queen
MercuryBria
nMayProtest
RaidClas
hBankWest
MercuryBria
nMayProtest
RaidClas
hBankWest
Structured Relevance Models
Query F Results
Structured Data
• Term probabilities of query model is based on pseudo-relevant structured data
• Ranking behaves like similarity search between pseudo-relevant structured results and structured result candidates
Structured Data
queen single
MercuryBria
nMayProtest
RaidClas
hBankWest
MercuryBria
nMayProtest
RaidClas
hBankWest
Candidate Results
Importance of resource r w.r.t. query
Prob of observing term v in value of
property e of resource r
v RMname RMcomment RMx
Mercury .091 .01 …
Brian .082 .01 …
Champion .081 .02 …
Protest .001 .042 …
Raid .006 .014 …
… … … …
Ranking: construct edge-specific query model for each unique e from feedback resources FR, edge-specific model for every candidate r, and finally, compute distance
v RMname RMcomment RMx
Mercury .073 .01 …
Brian .052 .01 …
… … … …
For all resources r
in FR
QUERY REWRITING
Query Rewriting: find syntactically and semantically valid rewrites to suggest as user types
Freddie Mercury Queen
Queen Elizabeth 1 single
Singlewriter
single from freddy mercury que
Data Index
SchemaIndex
Keyword Interpretation: - Imprecise / fuzzy matching- Match every keyword
Token rewriting via syntactic distance
Search Space Construction
1) single from freddie mercury queen…
Token rewriting via semantic distance
1) single writer freddie mercury queen…
Freddie Mercury Queen
Singlewriter
Data Index
SchemaIndex
Query segmentation
1) single writer “freddie mercury” queen…
Search Space Construction
Result Retrieval & Ranking
Keyword / Key Phrase Interpretation: - Precise matching- Match keyword and key phrases
Benefits:- Higher selectivity of query terms (quality)- Reduced number of query terms (efficiency) - Better search experience…
Challenges: many rewrite candidates, some are semantically not “valid” in the relational settingsingle (marital status) writer “freddie mercury” queen (the queen of UK)
Token Rewriting: S is ranked high when prob
that query Q can be observed in S is high
Query Segmentation: S is ranked high when prob that
S can be observed in the data D is high
Probability users write
spelling errors /
semantically related query
independent of data D
Constant given query Q
and data D
Based on Bayes‘ Theorem
Freddie Mercury BrianMay
Queen
Queen Elizabeth 1
Liar 1971 single
PersonArtist Single
membe
r
mem
ber
producer
formed in
marital
status
writer
single writer freddy mercury que
1) single writer freddie mercury queen2) single writer freddrick mercury monarch3) song writer freddrick mercury head of state
Probabilistic Model for Query Rewriting: the rank of a query rewrite (suggestion) S is based on the probability of observing S in the data, given the query
Token Rewriting
• Modeling token rewriting P(Q|S)
• Independence assumption
• Modeling syntactic and semantic differences
single writer freddy mercury que
1) single writer “freddie mercury” queen2) single writer “freddrick mercury” monarch3) single writer “freddrick mercury” head of state
Split: | Concatenate: +
single | writer | freddie + mercury | queen
P(q|t): is high when q is syntactically and
semantically close to t
Query Segmentation
• Modeling query segmentation P(S|D)
• Nth order Markov assumption
where PD(αiti+1|t1α1t2…αi-1ti) stands for P(αiti+1|t1α1t2…αi-1ti,D).
single writer freddie mercury que
Freddie Mercury
BrianMay
Queen
Queen Elizabeth 1
Liar
1971
single
PersonArtist
Single
member
m e m b e r
producer
formed i n
marita
lstatus
writer
single writer freddie
α = concatenate? α = split?
Estimating Probability of Segmentation
• Maximum likelihood estimation (MLE)
where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj
Segmentation in structured data setting• Concatenate two segments si and sj when they co-occur in the data• Split when si and sj are connected (si sj),↭ i.e., when the two data
elements ni and ni mentioning si and sj are connected in the data
single writer freddie mercury queen
Freddie Mercury
BrianMay
QueenQueen
Elizabeth 1
Liar 1971 single
PersonArtist Single
member
mem
ber producer
formed in
marital
status
writersingle writer freddie
α = concatenate? α = split?
• Two cases: (1) l(si) ≥ N; (2) l(si) < N• (1) When the previously induced segment si has length equal or
more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict the next action αi on ti+1
• Estimation of probability
where C(st) denotes the count of co-occurrences of the sequence st in D and C(s ↭ t) is the count of all occurrences of token t connected to segment s
Estimating Probability of Segmentation Case 1: previous segment si has length equal or more than context N
freddie j. mercury queen freddie j. mercury queen
• (2) When the previous segment si has length less than N, i.e. l(si) < N, the action αi on the next token ti+1 depends on si and Pi(N), the set of segments that precede si that together with si, contains at most N tokens in total, i.e.,
• Estimation of probability
where C(P ↭ s) denotes the count of all occurrences of the segment s connected to all segments in P
Estimating Probability of Segmentation Case 2: previous segment si has length less than context N
single writer freddie mercury single writer freddie mercury
EXPERIMENTAL RESULTS & CONCLUSIONS
• Graphinder, a relational keyword search approach for suggesting query completions, translating queries and ranking results
• Keyword translation performance– Query translation and index-based approaches at least one-order of magnitude
faster than online in-memory search (bidirectional) – Query translation comparable with index-based approaches, but less space
• Keyword translation result quality– According to recent benchmark, our ranking consistently outperforms all
existing ranking systems in precision, recall and MAP (10% - 30% improvement)• Effect of query rewriting
– Better user experience– Improves efficiency by reducing number of query terms– Improves quality / selectivity of query terms– …depends on complexity of queries and underlying keyword search engine
• Tight integration of query suggestion and translation• From research prototypes to Graphinder, a powerful, flexible, low upfront-cost
semantic search system
References (1)– [VLDB14] Yongtao Ma, Thanh Tran
Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on Graph DataIn International Conference on Very Large Data Bases (VLDB'14). Hangzhou, China, September, 2014
– [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran Federated Entity Search Using On-the-Fly ConsolidationIn International Semantic Web Conference (ISWC'13). Sydney, Australia, October, 2013
– [ICDE13] Yongtao Ma, Thanh TranTYPifier: Inferring the Type Semantics of Structured DataIn International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013
– [WSDM13] Yongtao Ma, Thanh TranTYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data Integration
In International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013
– [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian RudolphManaging Structured and Semi-structured RDF Data Using Structure IndexesIn Transactions on Knowledge and Data Engineering journal.
– [TKDE12b] Thanh Tran, Lei ZhangKeyword Query RoutingIn Transactions on Knowledge and Data Engineering journal.
References (2)– [WWW12] Daniel Herzig, Thanh Tran
Heterogeneous Web Data Search Using Relevance-based On The Fly Data IntegrationIn Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012
– [CIKM11a] Günter Ladwig, Thanh TranIndex Structures and Top-k Join Algorithms for Native Keyword Search DatabasesIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011
– [CIKM11b] Veli Bicer, Thanh TranRanking Support for Keyword Search on Structured Data using Relevance ModelsIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011
– [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran Duc Repeatable and Reliable Search System Evaluation using CrowdsourcingIn Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011
– [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009
– [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer Hermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009
BACKUP