Chapter 4 Query Languages
description
Transcript of Chapter 4 Query Languages
![Page 1: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/1.jpg)
Chapter 4Query Languages
..
.
![Page 2: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/2.jpg)
Introduction Cover different kinds of queries
posed to text retrieval systems Keyword-based query languages
include simple words and phrases as well as Boolean operators
Pattern matching complement keyword searching with
data retrieval capabilities Structural queries
querying on structure of text
![Page 3: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/3.jpg)
Keyword-Based Querying Query is formulation of user information
need Keyword-based queries are popular
intuitive easy to express allow for fast ranking
Query can be simply a word in general more complex combination of
operations involving several words
![Page 4: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/4.jpg)
Single-Word Queries Most elementary query is a word Word is sequence of letters surrounded by
separators some characters are not letters but do not split
a word, e.g. hyphen in on-line Result of word queries is
set of documents containing at least one of the words in query
resulting documents are ranked term frequency (count of word in document) inverse document frequency (count of no. of
documents in which word appears)
![Page 5: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/5.jpg)
Context Queries Complement single-word queries
with ability to search words in given context, I.e. near other words
words near each other signal higher likelihood of relevance than if they appear apart
form phrases of words or find words which are proximal in text
![Page 6: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/6.jpg)
Phrase Sequence of single-word queries for instance, possible to search for
word ‘enhance’ and then word ‘retrieval’
uninteresting words in text are not considered at all e.g. above example query could match
text such as ‘…enhance the retrieval…’
![Page 7: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/7.jpg)
Proximity Sequence of single words or phrases
is given together with maximum allowed distance between them
For instance, above query stated as ‘enhance’ and ‘retrieval’ should occur
within four words a possible match could be ‘… enhance the
power of retrieval…’ Distance can be measured in characters
or words
![Page 8: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/8.jpg)
Boolean Queries Oldest form of combining keyword
queries is to use Boolean operators Boolean query has following syntax
atoms (I.e. basic queries) that retrieve documents, and of
Boolean operators which work on their operands (sets of documents)
query syntax tree can be defined leaves are basic queries internal nodes are operators
![Page 9: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/9.jpg)
Boolean Queries (Cont.)
AND
translation OR
Retrieve all documents that contain the word ‘translation’ as well as either the word ‘syntax’ or the word ‘syntactic’
syntax syntactic
![Page 10: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/10.jpg)
Boolean Queries (Cont.) No ranking of retrieved documents
provided document either satisfies query
(retrieved) or does not (not retrieved) does not allow partial matching between
document and user query to overcome this limitation, idea of ‘fuzzy
Boolean’ set of operators proposed instead of all the operands (AND) or at
least in one of operands (OR), retrieve elements in some operands
![Page 11: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/11.jpg)
Natural Language Distinction between AND and OR
completely blurred simply an enumeration of words and
context queries all documents matching portion of user
query are retrieved higher ranking assigned to documents
matching more parts of query eliminated any reference to Boolean
operators
![Page 12: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/12.jpg)
Pattern Matching Query formulation based on concept of pattern that allow retrieval of pieces of text that have some property
Pattern is set of syntactic features that occur in text segment
Segments satisfying pattern specification said to ‘match’ the pattern
We are interested in documents containing segments that match given search pattern
![Page 13: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/13.jpg)
Pattern Matching (Cont.) Most used types of pattern are
words string (sequence of characters) that is a word
in text
prefixes string that form beginning of text word prefix ‘comput’ retrieve documents with words
such as ‘computer’, ‘computation’
suffixes string that form termination of word suffix ‘ters’ retrieve documents with words
such as ‘testers’, ‘computers’
![Page 14: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/14.jpg)
substrings string which can appear within word substring ‘tal’ retrieve documents with words
such as ‘coastal’, ‘talk’, ‘metallic’
ranges pair of strings that match any word lying
between them in lexicographical order alphabets sorted to order string into
lexicographical order (dictionary order) range between words ‘held’ and ‘hold’ retrieve
strings such as ‘hoax’, ‘hissing’
![Page 15: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/15.jpg)
allowing errors word together with error threshold retrieves all text words ‘similar’ to given word pattern may have errors (typing, spelling) and
documents with words with erroneous variants are retrieved (with edit distance)
if typing error splits ‘flower’ into ‘flo wer’, still found with one error
regular expression (r.e.) r.e. is built up by simple strings and operators
like union, concatenation and repetition query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’
will match words like ‘problem02’, ‘proteins’
![Page 16: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/16.jpg)
Extended patterns subset of regular expressions expressed with
simpler syntax classes of characters, I.e. some position in
pattern matched by any character from pre-defined set (e.g. some characters must be digit, not a letter, vowel, etc.)
conditional expressions, I.e. part of pattern may or may not appear
wild characters, I.e. match any sequence in text (e.g. any word starts as ‘flo’ and ends with ‘ers’ which match ‘flowers’ as well as ‘flounders’
![Page 17: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/17.jpg)
Structural Queries Allowing user to query texts based on
structure, and not content mixing contents and structure in queries
can pose powerful queries (much more expressive)
An example select set of documents that satisfy certain
constraints on content (using word, phrase, or patterns) and then
structural constraints expressed using containment, proximity, or chapters, sections present in documents
![Page 18: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/18.jpg)
Types of structures
fixed structure
hypertext
hierarchical structure
![Page 19: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/19.jpg)
Fixed Structure Document has fixed set of fields each field has some text inside some fields not present in all documents fields not allowed to nest or overlap retrieval activity restricted to specifying
that given pattern was to be found only in given fields
this model reasonable when text collection has fixed structure
![Page 20: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/20.jpg)
Hypertext Retrieval from hypertext began as
navigational activity user manually traverse hypertext nodes
following links to search what he wanted not possible to query hypertext based on
its structure WebGlimpse - interesting proposal to
allow navigation plus ability to search by content in neighborhood of current node
![Page 21: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/21.jpg)
Hierarchical Structure Represent recursive decomposition of
text natural model for many text collections Figure 4.3 shows example of hierarchical
structure that consists of page of a book, its schematic view and parsed query to retrieve figure
![Page 22: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/22.jpg)
Hierarchical ModelsPAT Expressions Structure marked in the text by tags (as
in HTML) defined in terms of initial and final tags
each pair of initial and final tags defines a region, set of contiguous text areas area of region cannot nest or overlap
possible to select areas containing other areas, contained in other areas, or followed by other areas
![Page 23: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/23.jpg)
Overlapped Lists Allows area of regions to overlap, but not
to nest considers use of inverted list where
words and also regions are indexed allows to perform set union, and to
combine regions
![Page 24: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/24.jpg)
List of References Attempt to make definition and querying
of structured text uniform, using common language
the language allows for querying on ‘path expressions’, which describe paths in structure tree
answers to queries are list of ‘references’ reference is pointer to region of database
![Page 25: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/25.jpg)
Proximal Nodes Tries to find good compromise between
expressiveness and efficiency specifies fully compositional language
where leaves of query syntax tree formed by basic queries on contents or names of structural elements (e.g. all chapters)
internal nodes combine results for efficiency, operations at internal
nodes must relate nodes close in text
![Page 26: Chapter 4 Query Languages](https://reader035.fdocuments.us/reader035/viewer/2022062217/5681462f550346895db33f79/html5/thumbnails/26.jpg)
Tree Matching Relies on single primitive: tree inclusion interpret structure of text database and
query as trees determine embedding of query into
database which respects hierarchical relationships between nodes of query
simple queries return roots of the matches