The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search
description
Transcript of The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search
![Page 1: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/1.jpg)
1
The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search
Sihem Amer-YahiaAT&T Labs Research - USADatabase Department
Talk at the Universities of Toronto and WaterlooNov. 9th and 10th, 2005
![Page 2: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/2.jpg)
2
Outline
Introduction Querying Scoring Evaluation Open Issues
![Page 3: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/3.jpg)
3
Outline
Introduction IR vs. Structured Document Retrieval (SDR) XML vs. IR Search
Querying Scoring Evaluation Open Issues
![Page 4: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/4.jpg)
4
IR vs SDR
Traditional IR is about finding relevant documents to a user’s information need, e.g., entire book.
SDR allows users to retrieve document components that are more focussed on their information needs, e.g., a chapter, a page.
• Improve precision• Exploit visual memory
![Page 5: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/5.jpg)
5
Documents Query
Document representation
Retrieval results
Query representation
Indexing Formulation
Retrieval function
Relevancefeedback
Conceptual Model for IR
(Van Rijsbergen 1979)
![Page 6: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/6.jpg)
6
Conceptual Model for SDRStructured documents Content + structure
Inverted file + structure index
tf, idf, …
Matching content + structure
Presentation of related components
Documents Query
Document representation
Retrieval results
Query representation
Indexing Formulation
Retrieval function
Relevancefeedback
![Page 7: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/7.jpg)
7
Conceptual Model for SDR (XML)Structured documents Content + structure
Inverted file + structure index
tf, idf, …
Presentation of related components
Scoring may capture document structure
query languages referring to both content and structure are being developed for accessing XML documents, e.g. XIRQL, NEXI, XQUERY FT
XML adopted to represent a mix of structure and text (e.g., Library of Congress bills, IEEE INEX data collection)
structure index captures in which documentcomponent the term occurs (e.g. title, section),as well as the type of document components(e.g. XML tags)
additional constraints are imposed on structure
e.g. a chapter and its sections may be retrieved
Matching content + structure
![Page 8: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/8.jpg)
8
![Page 9: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/9.jpg)
9
XML Document Example http://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml
<bill bill-stage="Introduced-in-House"> <congress>109th CONGRESS</congress> <session>1st
Session</session> <legis-num>H. R. 2739</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-
chamber> <action> <action-date date="20050526">May 26, 2005</action-date> <action-desc><sponsor name-id="T000266">Mr. Tierney</sponsor>
(for himself, <cosponsor name-id="M001143">Ms. McCollum of Minnesota</cosponsor>, <cosponsor name-id="M000725">Mr. George Miller of California</cosponsor>) introduced the following bill; which was referred to the <committee-name committee-id="HED00">Committee on Education and the Workforce</committee-name>
</action-desc> </action>…
![Page 10: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/10.jpg)
10
THOMAS: Library of Congress
![Page 11: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/11.jpg)
11
Outline Introduction Querying
search context: XML nodes vs entire document. search result: XML nodes or newly constructed answers
vs entire document. search expression: keyword search, Boolean operators,
proximity distance, scoping, thesaurus, stop words, stemming.
document structure: explicitly specified in query or used in query semantics.
Scoring Evaluation Open Issues
![Page 12: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/12.jpg)
12
Languages for XML Search
Keyword search (CO Queries) “xml”
Tag + Keyword search book: xml
Path Expression + Keyword search (CAS Queries) /book[./title about “xml db”]
XQuery + Complex full-text search for $b in /book
let score $s := $b ftcontains “xml” && “db” distance 5
![Page 13: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/13.jpg)
13
XRank<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section>
<cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …
(Guo et al, SIGMOD 2003)
![Page 14: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/14.jpg)
14
<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …
XRank
![Page 15: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/15.jpg)
15
XIRQL <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <em> The XQL language </em> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … (Fuhr & Großjohann, SIGIR 2001)
index
nodes
![Page 16: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/16.jpg)
16
Similar Notion of Results
Nearest Concept Queries (Schmidt et al, ICDE 2002)
XKSearch (Xu & Papakonstantinou, SIGMOD
2005)
![Page 17: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/17.jpg)
17
Languages for XML Search
Keyword search (CO Queries) “xml”
Tag + Keyword search book: xml
Path Expression + Keyword search (CAS Queries) /book[./title about “xml db”]
XQuery + Complex full-text search for $b in /book
let score $s := $b ftcontains “xml” && “db” distance 5
![Page 18: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/18.jpg)
18
XSearch <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … … </paper> <paper id=”2”> <title> XML Indexing </title> …
Not a“meaningful”
result
(Cohen et al, VLDB 2003)
![Page 19: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/19.jpg)
19
Languages for XML Search Keyword search (CO Queries)
“xml” Tag + Keyword search
book: xml Path Expression + Keyword search (CAS Queries)
/book[./title about “xml db”] XQuery + Complex full-text search
for $b in /booklet score $s := $b ftcontains “xml” && “db” distance 5
![Page 20: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/20.jpg)
20
XPath 2.0
fn:contains($e, string) returns true iff $e contains string
//section[fn:contains(./title, “XML Indexing”)]
(W3C 2005)
![Page 21: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/21.jpg)
21
XIRQL
Weighted extension to XQL (precursor to XPath)
//section[0.6 · .//* $cw$ “XQL” + 0.4 · .//section $cw$ “syntax”]
(Fuhr & Großjohann, SIGIR 2001)
![Page 22: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/22.jpg)
22
XXL Introduces a similarity operator ~
Select ZFrom http://www.myzoos.edu/zoos.htmlWhere zoos.#.zoo As Z and Z.animals.(animal)?.specimen as A and A.species ~ “lion” and A.birthplace.#.country as B and A.region ~ B.content
(Theobald & Weikum, EDBT 2002)
![Page 23: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/23.jpg)
23
NEXI
Narrowed Extended XPath I INEX Content-and-Structure (CAS) Queries Specifically targeted for content-oriented XML
search (i.e. “aboutness”)
//article[about(.//title, apple) and about(.//sec, computer)]
(Trotman & Sigurbjornsson, INEX 2004)
![Page 24: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/24.jpg)
24
Languages for XML Search
Keyword search (CO Queries) “xml”
Tag + Keyword search book: xml
Path Expression + Keyword search (CAS Queries) /book[./title about “xml db”]
XQuery + Complex full-text search for $b in /book
let score $s := $b ftcontains “xml” && “db” distance 5
![Page 25: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/25.jpg)
25
Schema-Free XQuery
Meaningful least common ancestor (mlcas)
for $a in doc(“bib.xml”)//author $b in doc(“bib.xml”)//title $c in doc(“bib.xml”)//yearwhere $a/text() = “Mary” and exists mlcas($a,$b,$c)return <result> {$b,$c} </result>
(Li, Yu, Jagadish, VLDB 2003)
![Page 26: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/26.jpg)
26
TeXQuery and XQuery FT Fully composable FT primitives. Composable with XPath/XQuery. Based on a formal model. Scoring and ranking on all predicates.
(Amer-Yahia, Botev, Shanmugasundaram, WWW 2004)(http://www.w3.org/TR/xquery-full-text/, W3C 2005)
2003
2004 + 2005
TeXQuery(Cornell U., AT&T Labs)
IBM, Microsoft,Oracle proposals
XQuery Full-TextDrafts
![Page 27: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/27.jpg)
27
FTSelections and FTMatchoptions FTWord | FTAnd | FTOr | FTNot | FTMildNot | FTOrder | FTWindow |
FTDistance | FTScope | FTTimes | FTSelection (FTMatchOptions)*
books//title [. ftcontains “usability” case sensitive with thesaurus “synonyms” ]
books//abstract [. ftcontains (“usability” || “web-testing”) ]
books//content ftcontains (“usability” && “software”) window at most 3 ordered with stopwords
books//abstract [. ftcontains ((“Utilisation” language “French” with stemming && “.?site” with wildcards) same sentence]
books//title ftcontains “usability” occurs 4 times && “web-testing” with special characters
books//book/section [. ftcontains books/book/title ]/title
![Page 28: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/28.jpg)
28
FTScore ClauseFOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN
Example
FOR $b SCORE $s in FUZZY/pub/book[. ftcontains “Usability” && “testing”
and ./price < 10.00] ORDER BY $s
RETURN $b
In anyorder
![Page 29: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/29.jpg)
29
GalaxXQuery Engine
Full-Text Primitives (FTWord,
FTWindow, FTTimesetc.)
evaluation <doc> Text TextText Text </doc>
.xmlGalaTexParser
EquivalentXQueryQuery
XQFTQuery
Preprocessing& Inverted Lists
Generation
<xml> <doc>Text TextText Text </doc></xml
inverted lists
.xml
4GalaTex Architecture
positions API
(http://www.galaxquery.org/galatex)
![Page 30: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/30.jpg)
30
Outline
Introduction Querying Scoring Evaluation Open Issues
![Page 31: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/31.jpg)
31
Scoring Keyword queries and Tag + Keyword queries
initial term weights per element. elements with same tag may have same score. score propagation along document structure. overlapping elements.
Path Expression + Keyword queries initial term weights based on paths.
XQuery + Complex full-text queries compute scores for (newly constructed) XML
fragments satisfying XQuery (structural, full-text and scalar conditions).
![Page 32: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/32.jpg)
32
Article ?XML,?search,?retrieval
0.9 XML 0.5 XML 0.2 XML
0.4 search 0.7 retrieval
Term Weights
Title Section 1 Section 2
how to obtain document and collection statistics (e.g., tf, idf) how to estimate element scores (frequency, user studies, size)?
which components contribute best to content of Article? do we need edge weights (e.g., size, number of children)? is element size an issue?
0.5 0.8 0.2
![Page 33: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/33.jpg)
33
Score Propagation (XXL)
Title Section 1 Section 2
Compute similar terms with relevance score r1 using an ontology (weighted distance in the ontology graph).
Compute TFIDF of each term for a given element content with relevance score r2.
Relevance of an element content for a term is r1*r2. Probabilities of conjunctions multiplied (independence assumption) along elements of same path to compute path score.
Article ?XML,?search, ?retrieval
0.9 XML 0.5 XML 0.2 XML 0.4 search 0.7 retrieval
(Theobald & Weikum, EDBT 2002)
![Page 34: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/34.jpg)
34
Overlapping elements
Title Section 1 Section 2
Section 1 and article are both relevant to “XML retrieval” which one to return so that to reduce overlap? Should the decision be based on user studies, size, types, etc?
Article ?XML,?search, ?retrieval
0.9 XML 0.5 XML 0.2 XML 0.4 search 0.7 retrieval
![Page 35: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/35.jpg)
35
Controlling Overlap• Start with a component ranking, elements are re-ranked to
control overlap.• Retrieval status values (RSV) of those components containing
or contained within higher ranking components are iteratively adjusted.
1. Select the highest ranking component.
2. Adjust the RSV of the other components.
3. Repeat steps 1 and 2 until the top m components have been selected.
(Clarke, SIGIR 2005)
![Page 36: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/36.jpg)
36
ElemRank
w
: Hyperlink edged1/3
d1/3
d1/3
d1: Probability of following hyperlink
1-d1-d2-d3: Probability of random jump
: Containment edge
d2/2 d2/2
d2: Probability of visiting a subelement
d3
d3: Probability of visiting parent
)(ve HEvu h uN
ued),(
1 )()(
CEvu c uN
ued),(
2 )()(
1),(
3 )(CEvu
ued)(
1 321
vNddd
(Guo et al, SIGMOD 2003)
![Page 37: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/37.jpg)
37
Scoring Keyword queries
compute possibly different scores. Tag + Keyword queries
compute scores based on tags and keywords. Path Expression + Keyword queries
compute scores based on paths and keywords. XQuery + Complex full-text queries
compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).
![Page 38: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/38.jpg)
38
Vector–based Scoring (JuruXML) Transform query into (term,path) conditions: article/bm/bib/bibl/bb[about(., hypercube mesh torus
nonnumerical database)] (term,path)-pairs:
hypercube, article/bm/bib/bibl/bb mesh, article/bm/bib/bibl/bb torus, article/bm/bib/bibl/bb nonnumerical, article/bm/bib/bibl/bb database, article/bm/bib/bibl/bb
Modified cosine similarity as retrieval function for vague matching of path conditions.
(Mass et al, INEX 2002)
![Page 39: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/39.jpg)
39
JuruXML Vague Path Matching Modified vector-based cosine similarity
1|Q || D |
wQ ( t i( t
i,c
ij
D )D
(t
i,c
i
Q )Q ,ci
Q ) wD ( ti ,cij
D ) cr(ciQ ,ci
j
D )
otherwise
ccifcc
cccrQi
QiD
i
Qi
D
i
Qi
j
0||1||1
),(
Example of length normalization: cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5
![Page 40: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/40.jpg)
40
XML Query Relaxation Tree pattern relaxations:
Leaf node deletion Edge generalization Subtree promotion
book
editionpaperback
info
authorDickens
book
editionpaperback
info authorDickens
book
info
authorC. Dickens
book
edition(paperback)
info
authorCharles Dickens
edition?
Query
Data
(Schlieder, EDBT 2002)(Delobel & Rousset, 2002)(Amer-Yahia, Lakshmanan, Pandit, SIGMOD 2004)
![Page 41: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/41.jpg)
41
A Family of Scoring Methods Twig scoring
High quality Expensive computation
Path scoring Binary scoring
Low quality Fast computation
book
edition(paperback)
info
author(Dickens)
Query
book
edition(paperback)
info
author(Dickens)
book
edition(paperback)
author(Dickens)
book
info
+
edition(paperback)
author(Dickens)
bookbook
info
+ +book
(Amer-Yahia, Koudas, Marian, Srivastava, Toman, VLDB 2005)
![Page 42: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/42.jpg)
42
Scoring Keyword queries
compute possibly different scores. Tag + Keyword queries
compute scores based on tags and keywords. Path Expression + Keyword queries
compute scores based on paths and keywords. Evaluate effectiveness of scoring methods.
XQuery + Complex full-text queries compute scores for (newly constructed) XML
fragments satisfying XQuery (structural, full-text and scalar conditions).
compose approximation on structure and on text.
![Page 43: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/43.jpg)
43
Outline Introduction Querying Scoring Evaluation
Formalization of existing XML search languages
Structure-aware evaluation algorithms Implementation in GalaTex
Open Issues
![Page 44: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/44.jpg)
44
LOC document fragment<bill>
<congress><action><session> <legis_body>
109th
… Committee on Education …
…
<action-desc>
<sponsor><co-sponsor>
<committee-name>
1st session <action-date>
…
<committee-desc>
…and the Workforce
…Mr. Jefferson
![Page 45: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/45.jpg)
45
Sample Query on LOC
Find action descriptions of bills introduced by “Jefferson” with a committee name containing the words “education” and “workforce” at a distance of no more than 5 words in the text
![Page 46: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/46.jpg)
46
Data model
R
tokPos
Node tokPos1 ...
1.1
word position listworkforce {1, 3}
education {2}
Workforce 1 Education
2
1
1.2
1.1.1 1.1.2
Workforce 3
1.2.1
1.1
![Page 47: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/47.jpg)
47
Data model instantiation
Node tokPos1.2.2 k1 ; {6}
1.1.2 k1 ; {2, 4}
1.1.1 k1 ; {1}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
Node tokPos1.2.2 k1 ; {6}
1.2 k1 ; {6}
1.1.2 k1 ; {2, 4}
1.1.1 k1 ; {1}
1.1 k1 ; {1, 2, 4}
1 k1 ; {1, 2, 4, 6}
Instance 1: Rk1
-redundant storage
-each tuple is self-contained
Instance 2: scuRk1
-no redundant positions
-smallest nbr of nodes
k25
k16
1.2.1 1.2.2
One relation per keyword in the document
![Page 48: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/48.jpg)
48
FT-Algebra and Query Plan 5
σdistance({“education”},{”workforce”};≤5)
R“education” R“workforce”
∏node
σordered({“education”,”workforce”})
R“Jefferson”
×
×
×
EC
![Page 49: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/49.jpg)
49
Join Evaluation
Node tokPos1.2.2 k1 ; {6}
1.2 k1 ; {6}
1.1.2 k1 ; {2, 4}
1.1.1 k1 ; {1}
1.1 k1 ; {1, 2, 4}
1 k1 ; {1, 2, 4, 6}
Node tokPos1.2.1 k2 ; {5}
1.2 k2 ; {5}
1.1.2 k2 ; {3}
1.1 k2 ; {3}
1 k2 ; {3, 5}
Node tokPos1.2 k2 ; {5}
k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1, 2, 4}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
×
![Page 50: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/50.jpg)
50
Join Evaluation on SCU
Node tokPos1.2.2 k1 ; {6}
1.1.2 k1 ; {2, 4}
1.1.1 k1 ; {1}
Node tokPos1.2.1 k2 ; {5}
1.1.2 k2 ; {3}
Node tokPos
1.1.2 k2 ; {3}k1 ; {2, 4}
scuRk1 scuRk2
×
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
![Page 51: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/51.jpg)
51
Need for LCAsNode tokPos
1.2 k2 ; {5}k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
scuRk1 scuRk2
×
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
1.1
Node tokPos1.2.2 k1 ; {6}
1.1.2 k1 ; {2, 4}
1.1.1 k1 ; {1}
Node tokPos1.2.1 k2 ; {5}
1.1.2 k2 ; {3}
(Schmidt et al, ICDE 2002)(Li, Yu, Jagadish, VLDB 2003)(Guo et al, SIGMOD 2003)(Xu & Papakonstantinou, SIGMOD 2005)
![Page 52: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/52.jpg)
52
SCU: is LCA enough?
Node tokPos
1.2 k2 ; {5}k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
σdistance({“k1”},{”k2”};=2)
σordered({“k2”,”k1”})
fail
scuRk1
scuRk2
×
pass
![Page 53: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/53.jpg)
53
SCU: is LCA enough?
Node tokPos
1.2 k2 ; {5}k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
σdistance({“k1”},{”k2”};=2)
σordered({“k2”,”k1”})
fail
scuRk1
scuRk2
pass
×
![Page 54: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/54.jpg)
54
SCU: is LCA enough?
Node tokPos
1.2 k2 ; {5}k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
σdistance({“k1”},{”k2”};=2)
σordered({“k2”,”k1”}) fail
scuRk1
scuRk2
×Does not satisfy ‘ordered’ alone,
but it should be an answer!
![Page 55: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/55.jpg)
55
SCU: is LCA enough?
Node tokPos
1.2 k2 ; {5}k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
σdistance({“k1”},{”k2”};=2)
σordered({“k2”,”k1”}) fail
scuRk1
scuRk2
×
![Page 56: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/56.jpg)
56
SCU: position propagation
Node tokPos
1.2 k2 ; {5}k1 ; {6}
1.1.2 k2 ; {3}k1 ; {2, 4}
1.1 k2 ; {3}k1 ; {1}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
1.1
k11
k1, k2, k12 3 4
1
1.2
1.1.1 1.1.2
k25
k16
1.2.1 1.2.2
σdistance({“k1”},{”k2”};=2)
σordered({“k2”,”k1”})
Node tokPos1.1 k2 ; {3}
k1 ; {1, 2, 4}
1 k2 ; {3, 5}k1 ; {1, 2, 4, 6}
pass
pass
![Page 57: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/57.jpg)
57
SCU Summary Key ideas
R1 SCUR2 → find LCA σSCU(R) → propagation along doc. structure
if node satisfies σ predicate, output node o/w propagate its tokPos to its first ancestor in R
Benefit: reduces size of intermediate results Challenge: minimize computation overhead
selections additional column in R for direct access to ancestors TRIE structures
joins record highest ancestor in EC of each node in scuR and use
sort-merge
×
![Page 58: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/58.jpg)
58
GalaxXQuery Engine
<doc> Text TextText Text </doc>
.xml
Parser: to FT-Algebra
FT-Algebraplan
Full-TextQuery
Preprocessing& Inverted Lists
Generation
<xml> <doc>Text TextText Text </doc></xml
.xml
4GalaTex Architecture: in progress
inverted lists
BerkeleyDB(instance 1 / 2)
Code generationAllNodes / SCU
QueryExecution
FT-Algebraoperatorsimplem.
Executable code
EC
+positions API
![Page 59: The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search](https://reader030.fdocuments.us/reader030/viewer/2022020209/56813a83550346895da27fe0/html5/thumbnails/59.jpg)
59
Open Issues (in no particular order)
Difficult research issues in XML retrieval are not ‘just’ about the effective retrieval of XML documents, but also about what and how to evaluate!
System architecture: DB on top of IR, IR on top of DB, true merging?
Experimental evaluation of scoring methods (INEX). Score-aware algebra for XML for the joint optimization of
queries on both structure and text. More details: http://www.research.att.com/~sihem