Information Retrieval
-
Upload
sebastien-romano -
Category
Documents
-
view
24 -
download
0
description
Transcript of Information Retrieval
Information Retrieval
Introduction/Overview
Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
CSE 5331/7331 F07 2
Information Retrieval
Information Retrieval (IR): retrieving desired information from textual data.
Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query:
Find all documents about “data mining”.
CSE 5331/7331 F07 3
DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional
business systesm IR grew out of library science and need
to categorize/group/access books/articles
CSE 5331/7331 F07 4
DB vs IR (cont’d)
Data retrievalwhich docs contain a set of keywords?Well defined semanticsa single erroneous object implies failure!
Information retrievalinformation about a subject or topicsemantics is frequently loosesmall errors are tolerated
IR system:interpret contents of information itemsgenerate a ranking which reflects relevancenotion of relevance is most important
CSE 5331/7331 F07 5
Motivation
IR in the last 20 years:classification and categorizationsystems and languagesuser interfaces and visualization
Still, area was seen as of narrow interestAdvent of the Web changed this perception once and for all
universal repository of knowledge free (low cost) universal accessno central editorial boardmany problems though: IR seen as key to finding the solutions!
CSE 5331/7331 F07 6
Basic Concepts
Logical view of the documents
Document representation viewed as a continuum: logical view of docs might shift
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexingDocs
structure Full text Index terms
CSE 5331/7331 F07 7
UserInterface
Text Operations
Query Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager Module
Text Database
Text
The Retrieval Process
CSE 5331/7331 F07 8
IR is Fuzzy
Simple Fuzzy
Accept Accept
RejectReject
CSE 5331/7331 F07 9
Information Retrieval
Similarity: measure of how close a query is to a document.
Documents which are “close enough” are retrieved.
Metrics: Precision = |Relevant and Retrieved|
|Retrieved| Recall = |Relevant and Retrieved|
|Relevant|
CSE 5331/7331 F07 10
Indexing
IR systems usually adopt index terms to process queries
Index term: a keyword or group of selected words any word (more general)
Stemming might be used: connect: connecting, connection, connections
An inverted file is built for the chosen index terms
CSE 5331/7331 F07 11
Indexing Docs
Information Need
Index Terms
doc
query
Rankingmatch
CSE 5331/7331 F07 12
Inverted Files There are two main elements:
vocabulary – set of unique terms Occurrences – where those terms appear
The occurrences can be recorded as terms or byte offsets
Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access
Vocabulary Occurrences (byte offset)
… …
CSE 5331/7331 F07 13
Inverted Files
The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs)
The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file
That may lead to a quite considerable index overhead
CSE 5331/7331 F07 14
Example Text:
Inverted file
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are beautiful
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Vocabulary Occurrences
CSE 5331/7331 F07 15
Ranking
A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query
A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance
Each set of premisses leads to a distinct IR model
CSE 5331/7331 F07 16
Classic IR Models - Basic Concepts
Each document represented by a set of representative keywords or index terms
An index term is a document word useful for remembering the document main themes
Usually, index terms are nouns because nouns have meaning by themselves
However, search engines assume that all words are index terms (full text representation)
CSE 5331/7331 F07 17
Classic IR Models - Basic Concepts
The importance of the index terms is represented by weights associated to them
ki- an index term dj - a document wij - a weight associated with (ki,dj) The weight wij quantifies the importance of
the index term for describing the document contents
CSE 5331/7331 F07 18
Classic IR Models - Basic Concepts
t is the total number of index terms K = {k1, k2, …, kt} is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to
doc dj= (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj
gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)
CSE 5331/7331 F07 19
The Boolean Model
Simple model based on set theory Queries specified as boolean expressions
precise semantics and neat formalism Terms are either present or absent. Thus,
wij {0,1} Consider
q = ka (kb kc) qdnf = (1,1,1) (1,1,0) (1,0,0) qcc= (1,1,0) is a conjunctive component
CSE 5331/7331 F07 20
The Vector Model
Use of binary weights is too limiting Non-binary weights provide
consideration for partial matches These term weights are used to
compute a degree of similarity between a query and each document
Ranked set of documents provides for better matching
CSE 5331/7331 F07 21
The Vector Model
wij > 0 whenever ki appears in dj
wiq >= 0 associated with the pair (ki,q) dj = (w1j, w2j, ..., wtj) q = (w1q, w2q, ..., wtq) To each term ki is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal
(i.e., index terms are assumed to occur independently within the documents)
The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors
CSE 5331/7331 F07 22
Query Languages Keyword Based Boolean Weighted Boolean Context Based (Phrasal &
Proximity) Pattern Matching Structural Queries
CSE 5331/7331 F07 23
Keyword Based Queries
Basic Queries Single word Multiple words
Context Queries Phrase Proximity
CSE 5331/7331 F07 24
Boolean Queries Keywords combined with Boolean
operators: OR: (e1 OR e2) AND: (e1 AND e2) BUT: (e1 BUT e2) Satisfy e1 but not e2
Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set.
Naïve users have trouble with Boolean logic.
CSE 5331/7331 F07 25
Boolean Retrieval with Inverted Indices
Primitive keyword: Retrieve containing documents using the inverted index.
OR: Recursively retrieve e1 and e2 and take union of results.
AND: Recursively retrieve e1 and e2 and take intersection of results.
BUT: Recursively retrieve e1 and e2 and take set difference of results.
CSE 5331/7331 F07 26
Phrasal Queries
Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory”
May allow intervening stop words and/or stemming. “buy camera” matches:
“buy a camera” “buying the cameras” etc.
CSE 5331/7331 F07 27
Phrasal Retrieval with Inverted Indices
Must have an inverted index that also stores positions of each keyword in a document.
Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions.
Best to start contiguity check with the least common word in the phrase.
CSE 5331/7331 F07 28
Proximity Queries
List of words with specific maximal distance constraints between terms.
Example: “dogs” and “race” within 4 words match “…dogs will begin the race…”
May also perform stemming and/or not count stop words.
CSE 5331/7331 F07 29
Pattern Matching
Allow queries that match strings rather than word tokens.
Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.
CSE 5331/7331 F07 30
Simple Patterns Prefixes: Pattern that matches start of word.
“anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word:
“ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary
subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc.
Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.