PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics
description
Transcript of PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics
TopXTopXEfficient and Versatile Efficient and Versatile
Top-k Query Processing for Top-k Query Processing for Text, Structured, and Semistructured DataText, Structured, and Semistructured Data
PhD DefenseMay 16th
2006
Martin Theobald
Max Planck Institute for Informatics
VLDB ‘05
“Native XML data base systems can store schemaless data ... ”
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML-QL: A Query Language for XML.”
“Native XML Data Bases.”
“Proc. Query Languages Workshop, W3C,1998.”
“XML queries with an expres- sive power similar to that of Datalog …”
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment”
itempar
title inproc
title
//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]
An XML-IR Scenario (INEX IEEE) …
“What does XML add for retrieval? It adds formal ways …”
“w3c.org/xml”
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “The
XML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
bib
“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
RANKINGRANKINGRANKINGRANKING
VAGUENESSVAGUENESSVAGUENESSVAGUENESS
PRUNINGPRUNINGPRUNINGPRUNING
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Data Model
XML tree modelPre/postorder labels for all tags and merged tag-term pairs XPath Accelerator [Grust, Sigmod ’02]
Redundant full-content text nodes Full-content term frequencies ftf(ti,e)
<article>
<title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data.</par> </sec></article>
“xml data manage xml manage system vary wide expressive power native xml
data base native xml data base system store schemaless data“
“native xml data base native xml data base
system store schemaless data“
“xml data
manage”
articlearticle
titletitle absabs secsec
“xml manage system vary
wide expressivepower“
“native xml data base”
“native xml data base system store schemaless data“
titletitle parpar
1 6
2 1 3 2 4 5
5 3 6 4
ftf(“xml”, article1 ) = 4ftf(“xml”, article1 ) = 4
Full-Content Scoring Model
Extended Okapi-BM25 probabilistic model for XML with
element-specific parameterization [VLDB ’05 & INEX ’05]
Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avg.length k1 b
article 12,223 2,903 10.5 0.75
sec 96,709 413 10.5 0.75
par 1,024,907 32 10.5 0.75
fig 109,230 13 10.5 0.75
individualelementstatistics
Additional static score mass c for relaxable structural conditions
and non-conjunctive (“andish”) XPath evaluations
bib[“transactions”]vs.
par[“transactions”]
bib[“transactions”]vs.
par[“transactions”]
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Inverted Block-Index for Content & Structure
sec[“xml”]
title[“native”] par[“retrieval”]
Combined inverted index over merged tag-term pairs (on redundant element full-contents)
Sequential block-scans Group elements in descending order of (maxscore, docid) per listBlock-scan all elements per doc for a given (tag, term) key
Stored as inverted files or database tables
(two B+-tree indexes over full range of attributes)
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
Random Access (RA)
SortedAccess
(SA)
Navigational Index
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
sec
title[“native”] par[“retrieval”]
sec
Additional element directoryRandom accesses on B+-tree index using (docid, tag) as keyCarefully scheduled probes
Schema-oblivious indexing & queryingNon-schematic, heterogeneous data sources (no DTD required) Supports full NEXI syntaxSupports all 13 XPath axes (+level )
Random Access
(RA)
title[“native”] par[“retrieval”]eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
SortedAccess
(SA)
C=1.0
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
TopX Query Processor
Adapt Threshold Algorithm (TA) paradigm [Fagin et al., PODS ‘01]
Focus on inexpensive SA & postpone expensive RA (NRA & CA) Keep intermediate top-k & enqueue partially evaluated candidates
Lower/Upper score guarantees for each candidate dRemember set of evaluated query dimensions E(d)
worstscore(d) = ∑iE(d) score(ti, ed)bestscore(d) = worstscore(d) + ∑iE(d) highi
Early min-k threshold terminationReturn current top-k, iff
TopX core engine [VLDB ’04]
SA batching & efficient queue managementMulti-threaded SA & query processingProbabilistic cost model for RA schedulingProbabilistic candidate pruning for approximate top-k results
XML engine [VLDB ’05]
Efficiently deals with uncertainty in the structure & content (“andish XPath”)Controlled amount of RA (unique among current XML-top-k engines)Dynamically switch between document & element granularity
1.0
worst=0.9best=2.9
46 worst=0.5best=2.5
9
TopX Query Processing By Example (NRA)
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 14 10
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
worst=1.0best=2.8
3
worst=0.9best=2.8
216
171 worst=0.85best=2.75
72
worst=0.8best=2.65
worst=0.9best=2.8
46
2851
worst=0.5best=2.4
9doc2 doc17 doc1worst=0.9
best=2.75
216
doc5worst=1.0best=2.75
3
doc3
worst=0.9best=2.7
46
2851
worst=0.5best=2.3
9 worst=0.85best=2.65
171worst=1.7best=2.5
46
28
worst=0.5best=1.3
9 worst=0.9best=2.55
216
worst=1.0best=2.65
3
worst=0.85best=2.45
171
worst=0.8best=2.45
72
worst=0.8best=1.6
72
worst=0.1best=0.9
84
worst=0.9best=1.8
216
worst=1.0best=1.9
3
worst=2.2best=2.2
46
2851
worst=0.5best=0.5
9 worst=1.0best=1.6
3
worst=0.85best=2.15
171 worst=1.6best=2.1
171
182
worst=0.9best=1.0
216
worst=0.0best=2.9
Pseudo-
docworst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35
sec[“xml”] title[“native”]
Top-2 resultsworst=0.946 worst=0.59 worst=0.9
216
worst=1.746
28
worst=1.0
3
worst=1.6171
182
par[“retrieval”]1.0 1.0 1.00.9
0.850.1
0.90.80.5
0.8
0.75
min-2=0.0min-2=0.5min-2=0.9min-2=1.6
sec[“xml”]
title[“native”] par[“retrieval”]
Candidate queue
worst=2.246
2851
min-2=1.0
1.0 [169, 348]1.0 [351, 389]1.0 [392, 395]
0.21 [169, 348] 0.16 [351, 389] 0.11 [37, 46]
0.11 [351, 389]
1.0 [1, 419]
0.49 [174, 324] 0.14 [347, 343]0.13 [166, 164]0.12 [354, 353]
0.07 [389, 388]0.06 [354, 353]0.04 [375, 378] 0.02 [372, 371]
0.24 [354, 353]0.18 [357, 359]0.16 [65, 64]
“Andish” XPath over Element Blocks
Incremental & non-conjunctive XPath evaluations using Hash joins on the content conditions
Staircase joins [Grust, VLDB ‘03] on the structure
Tight & accurate [worstscore(d), bestscore(d)] bounds for early pruning (ensuring monotonous updates)
Virtual support elements for navigation
item=w3c
item=w3c
sec=xml
sec=retrieve
par=native
par=xml
par=database
SASA
1.0 [398, 418]
articlearticle
bibbib secsec
RARA
0.0 [*, *]
0.0 [*, *]
0.0 [*, *]
getSubtree-Score()
getParentScore()getSubtree-Score()
getSubtree-Score()
getParentScore()
worstscore(d) = 0.140.63
1.18
3.69 1.38C=1.0 C=0.2
0.2 [169, 348]0.2 [351, 389]0.2 [392, 395]
0.2 [1, 419]
0.2 [398, 418]
item=w3c
item=w3c
bibbib
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
MinProbe:Schedule RAs only for the most promising candidates
Extending “Expensive Predicates & Minimal Probing” [Chang&Hwang, SIGMOD ‘02]
Schedule batch of RAs on d, only iff
worstscore(d) + rd c > min-k
Random Access Scheduling – Minimal Probing
evaluated content & structure-related score
unresolved, static structural score mass
item=w3c
item=w3c
sec=xml
sec=retrieve
par=native
par=xml
par=database
articlearticle
bibbib secsec
0.16 [351, 389] 0.11 [351, 389]0.49 [174, 324] 0.06 [354, 353]0.24 [354, 353] 0.12 [354, 353]
SASA
RARA
1.0 [169, 348]
1.0 [1, 419]
1.0 [398, 418]
rank-k worstscore
Goal: Minimize overall execution cost #SA + cR/cS #RAAccess costs on d are wasted, if d does not make it into the final top-k (considering both structural selectivities & content scores)
Probabilistic cost model comparing different types of Expected Wasted Costs
EWC-RAs(d) of looking up d in the remaining structure
EWC-RAc(d) of looking up d in the remaining contentEWC-SA(d) of not seeing d in the next batch of b SAs
BenProbe: Schedule batch of RAs on d, iff
#EWC-RAs|c(d) cR/cS < #EWC-SA
Bounds the ratio between #RA and #SASchedule RAs late & lastSchedule RAs in asc. order of EWC-RAs|c(d)
Cost-based Scheduling (CA) – Ben Probing
Split the query into a set of basic, characteristic XML patterns:
twigs, paths & tag-term pairs
conjunctive
“andish”
Selectivity Estimator [VLDB ’05]
//sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”]
//sec[//figure]//par
//sec[//figure]//bib
//sec[//par]//bib
//sec//figure
//sec//par
//sec//bib
//bib=“vldb”
//par=“xml”
//figure=“java”
p1 = 0.682
p2 = 0.001
p3 = 0.002
p4 = 0.688
p5 = 0.968
p6 = 0.002
p7= 0.023
p8 = 0.067
p9 = 0.011
figure=“java”figure=“java”
secsec
par=“xml”par=“xml”
bib=“vldb”bib=
“vldb”bib=
“vldb”bib=
“vldb”
secsec
PS [d satisfies a subset Y’ of structural conditions Y] =
Consider binary correlations between structural patterns and/or tag-term pairs (data sampling, query logs, etc.)
Consider structural selectivities of unresolved & non-redundant patterns Y
PS [d satisfies all structural conditions Y] =
samp
ling
samp
ling
Score Predictor [VLDB ’04]
Consider score distributions of the content-related inverted lists
eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
title[“native”]
par[“retrieval”]0
f1
1 high1
f2
high21 0
PC [d gets in the final top-k] =
2 0δ(d)
Closed-form convolutions, e.g., truncated PoissonMoment-generating functions & Chernoff-Hoeffding boundsCombined score predictor & selectivity estimator
Convolutions of score histograms (assuming independence)Probabilistic candidate pruning:
Drop d from the candidate queue, iff
PC [d gets in the final top-k] < ε(with probabilistic guarantees for relative precision & recall)
Probabilistic candidate pruning: Drop d from the candidate queue, iff
PC [d gets in the final top-k] < ε(with probabilistic guarantees for relative precision & recall)
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Dynamic and Self-tuning Query Expansion [SIGIR ’05]
Incrementally merge inverted lists for a set of active expansions exp(t1)..exp(tm) in descending order of scores s(ti, d)
Max-score aggregation for fending off topic drifts
Dynamically expand set of active expansions only when beneficial for finding the final top-k results
Specialized expansion operatorsIncremental Merge operatorNested Top-k operator (phrase matching)Supports text, structured records & XMLBoolean (but ranked) retrieval mode
d42
d11
d92
...d
21
d78
d10
d11
...d
1
d37
d42
d32
...d
87
disaster
accident
fire
transport
d66
d93
d95
...d
101
tun
nel
d95
d17
d11
...d
99
Top-k (transport, tunnel,
~disaster)
Top-k (transport, tunnel,
~disaster)
d42 d11 d92 d37 …
~disaster
Incr. Merge
TREC RobustTREC RobustTopic no. Topic no. 363 363
SASA
SASA SASA
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Data Collections & Competitors
INEX ‘04 Ad-hoc Track settingIEEE collection with 12,223 docs & 12M elemt’s in 534 MB XML data46 NEXI queries with official relevance judgments and a strict quantization
e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”]
TREC ‘04 Robust Track settingAquaint news collection with 528,155 docs in 1,904 MB text data50 “hard” queries from TREC Robust Track ‘04 with official relevance judgments
e.g., “transportation tunnel disasters” or “Hubble telescope achievements”
Competitors for XML setupDBMS-style Join&Sort
Using index full scans on the TopX index (Holistic Twig Joins)
StructIndex [Kaushik et al, Sigmod ’04]
Top-k with separate indexes for content & structureDataGuide-like structural indexEager RAs (Fagin’s TA)
StructIndex+Extent chaining technique for DataGuide-based extent identifiers
(skip scans on the content index)
INEX: TopX vs. Join&Sort & StructIndex
0
2
4
6
8
10
12
1 5 10 50 100 500 1,000
Mil
lion
s
k
# S
A +
# R
A
Join&Sort
StructIndex+
StructIndex
BenProbe
MinProbe
3.2284,424723,1690.010TopX – BenProbe
0.17
0.09
17.023,25,068761,970n/a10StructIndex
12.0109,122,318n/a10Join&Sort
1.000.3480.025,074,38477,482n/a10StructIndex+
1.3864,807635,5070.010TopX – MinProbe
1.000.0316.101,902,427882,9290.01,000TopX – BenProbe
relP
rec
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk
rel.P
rec
46 NEXI Queries
INEX: TopX with Probabilistic Pruning
0.07
0.08
0.08
0.08
0.09
0.770.342.3156,952392,3950.2510
1.000.341.3864,807635,5070.0010TopX - MinProbe
0.650.310.9248,963231,1090.5010
0.510.330.4642,174102,1180.7510
0.380.300.4635,32736,9361.0010
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk re
l.Pre
c
46 NEXI Queries
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0 ε
rel. PrecP@10MAP
0
200,000
400,000
600,000
800,000
0 0.2 0.4 0.6 0.8 1.0 ε
#SA
+ #
RA
TopX -MinProbe
TREC Robust: Dynamic vs. Static Query Expansion
Careful WordNet expansions using automatic Word Sense Disambiguation & phrase detection [WebDB ’03 & PKDD ’05] with (m<118)MinProbe RA scheduling for phrase matching (auxiliary term-offset table)Incremental Merge + Nested Top-k (mtop< 22) vs. Static Expansions (mtop< 118)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ε
rel. Prec, Incr. Merge
rel. Prec, Static Expansion
P@10, Incr. Merge
P@10, Static Expansion
MAP, Incr. Merge
MAP, Static Expansion
0
2
4
6
8
10
12
14
16
18
20
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mil
lion
s
ε
#SA, Incr. Merge
#SA, Static Expansion
#RA, Incr. Merge
#RA, Static Expansion
50 Keyword + Phrase Queries
Conclusions
Efficient and versatile TopX query processorExtensible framework for XML-IR & full-text searchVery good precision/runtime ratio for probabilistic candidate pruningSelf-tuning solution for robust query expansions & IR-style vague search Combined SA and RA scheduling close to lower bound for CA access cost [Submitted for VLDB ’06]
ScalabilityOptimized for query processing IOExploits cheap disk space for redundant index structures
(constant redundancy factor of 4-5 for INEX IEEE)Extensive TREC Terabyte runs with 25,000,000 text documents (426 GB)
INEX 2006New Wikipedia XML collection with 660,000 documents & 120,000,000 elements (~ 6 GB raw XML)Official host for the Topic Development and Interactive Track
(69 groups registered worldwide)TopX WebService available (SOAP connector)
That’s it. That’s it. Thank you!Thank you!
TREC Terabyte: Comparison of Scheduling Strategies
Thanks to Holger Bast & Deb Majumdar!