Information Retrieval

Information Retrieval

Introduction/Overview

Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/book

http://www.sims.berkeley.edu/~hearst/irbook/





CSE 5331/7331 F07 2


Information Retrieval (IR): retrieving desired information from textual data.

Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query:

Find all documents about “data mining”.

CSE 5331/7331 F07 3

DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional

business systesm IR grew out of library science and need

to categorize/group/access books/articles

CSE 5331/7331 F07 4

DB vs IR (cont’d)

Data retrievalwhich docs contain a set of keywords?Well defined semanticsa single erroneous object implies failure!

Information retrievalinformation about a subject or topicsemantics is frequently loosesmall errors are tolerated

IR system:interpret contents of information itemsgenerate a ranking which reflects relevancenotion of relevance is most important

CSE 5331/7331 F07 5

Motivation

IR in the last 20 years:classification and categorizationsystems and languagesuser interfaces and visualization

Still, area was seen as of narrow interestAdvent of the Web changed this perception once and for all

universal repository of knowledge free (low cost) universal accessno central editorial boardmany problems though: IR seen as key to finding the solutions!

CSE 5331/7331 F07 6

Basic Concepts

Logical view of the documents

Document representation viewed as a continuum: logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

CSE 5331/7331 F07 7

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

CSE 5331/7331 F07 8

IR is Fuzzy

Simple Fuzzy

Accept Accept

RejectReject

CSE 5331/7331 F07 10

Indexing

IR systems usually adopt index terms to process queries

Index term: a keyword or group of selected words any word (more general)

Stemming might be used: connect: connecting, connection, connections

An inverted file is built for the chosen index terms

CSE 5331/7331 F07 11

Indexing Docs

Information Need

Index Terms

doc

query

Rankingmatch

CSE 5331/7331 F07 12

Inverted Files There are two main elements:

vocabulary – set of unique terms Occurrences – where those terms appear

The occurrences can be recorded as terms or byte offsets

Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access

Vocabulary Occurrences (byte offset)

… …

CSE 5331/7331 F07 13

Inverted Files

The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs)

The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file

That may lead to a quite considerable index overhead

CSE 5331/7331 F07 14

Example Text:

Inverted file

1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are beautiful

beautiful

flowers

garden

house

70

45, 58

18, 29

6

Vocabulary Occurrences

CSE 5331/7331 F07 15

Ranking

A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query

A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance

Each set of premisses leads to a distinct IR model

CSE 5331/7331 F07 16

Classic IR Models - Basic Concepts

Each document represented by a set of representative keywords or index terms

An index term is a document word useful for remembering the document main themes

Usually, index terms are nouns because nouns have meaning by themselves

However, search engines assume that all words are index terms (full text representation)

CSE 5331/7331 F07 17


The importance of the index terms is represented by weights associated to them

ki- an index term dj - a document wij - a weight associated with (ki,dj) The weight wij quantifies the importance of

the index term for describing the document contents

CSE 5331/7331 F07 18


t is the total number of index terms K = {k1, k2, …, kt} is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to

doc dj= (w1j, w2j, …, wtj) is a weighted vector

associated with the document dj

gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)

CSE 5331/7331 F07 19

The Boolean Model

Simple model based on set theory Queries specified as boolean expressions

precise semantics and neat formalism Terms are either present or absent. Thus,

wij {0,1} Consider

q = ka (kb kc) qdnf = (1,1,1) (1,1,0) (1,0,0) qcc= (1,1,0) is a conjunctive component

CSE 5331/7331 F07 20

The Vector Model

Use of binary weights is too limiting Non-binary weights provide

consideration for partial matches These term weights are used to

compute a degree of similarity between a query and each document

Ranked set of documents provides for better matching

CSE 5331/7331 F07 21

The Vector Model

wij > 0 whenever ki appears in dj

wiq >= 0 associated with the pair (ki,q) dj = (w1j, w2j, ..., wtj) q = (w1q, w2q, ..., wtq) To each term ki is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal

(i.e., index terms are assumed to occur independently within the documents)

The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

CSE 5331/7331 F07 22

Query Languages Keyword Based Boolean Weighted Boolean Context Based (Phrasal &

Proximity) Pattern Matching Structural Queries

CSE 5331/7331 F07 23

Keyword Based Queries

Basic Queries Single word Multiple words

Context Queries Phrase Proximity

CSE 5331/7331 F07 24

Boolean Queries Keywords combined with Boolean

operators: OR: (e1 OR e2) AND: (e1 AND e2) BUT: (e1 BUT e2) Satisfy e1 but not e2

Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set.

Naïve users have trouble with Boolean logic.

CSE 5331/7331 F07 25

Boolean Retrieval with Inverted Indices

Primitive keyword: Retrieve containing documents using the inverted index.

OR: Recursively retrieve e1 and e2 and take union of results.

AND: Recursively retrieve e1 and e2 and take intersection of results.

BUT: Recursively retrieve e1 and e2 and take set difference of results.

CSE 5331/7331 F07 26

Phrasal Queries

Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory”

May allow intervening stop words and/or stemming. “buy camera” matches:

“buy a camera” “buying the cameras” etc.

CSE 5331/7331 F07 27

Phrasal Retrieval with Inverted Indices

Must have an inverted index that also stores positions of each keyword in a document.

Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions.

Best to start contiguity check with the least common word in the phrase.

CSE 5331/7331 F07 28

Proximity Queries

List of words with specific maximal distance constraints between terms.

Example: “dogs” and “race” within 4 words match “…dogs will begin the race…”

May also perform stemming and/or not count stop words.

CSE 5331/7331 F07 29

Pattern Matching

Allow queries that match strings rather than word tokens.

Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

CSE 5331/7331 F07 30

Simple Patterns Prefixes: Pattern that matches start of word.

“anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word:

“ix” matches “fix”, “matrix”, etc. Substrings: Pattern that matches arbitrary

subsequence of characters. “rapt” matches “enrapture”, “velociraptor” etc.

Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. “tin” to “tix” matches “tip”, “tire”, “title”, etc.

Information Retrieval

Documents

Transcript of Information Retrieval