Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis...
Transcript of Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis...
E.G.M Petrakis Multimedia Information Retrieval
1
Information Retrieval (IR)
Deals with the representation, storage and retrieval of unstructured data
Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets
Classical IR deals mainly with text The evolution of multimedia databases
and of the web have given new interest to IR
E.G.M Petrakis Multimedia Information Retrieval
2
Multimedia IR
A multimedia information system that can store and retrieve attributes
text
2D grey-scale and color images
1D time series
digitized voice or music
video
E.G.M Petrakis Multimedia Information Retrieval
3
Applications
Financial, marketing: stock prices, sales etc. find companies whose stock prices move similarly
Scientific databases: sensor data whether, geological, environmental data
Office automation, electronic encyclopedias, electronic books
Medical databases X-rays, CT, MRI scans Criminal investigation suspects, fingerprints, Personal archives, text and color images
E.G.M Petrakis Multimedia Information Retrieval
4
Queries
Formulation of user’s information need Free-text query
By example document (e.g., text, image)
e.g., in a collection of color photos find those showing a tree close to a house
Retrieval is based on the understanding of the content of documents and of their components
E.G.M Petrakis Multimedia Information Retrieval
5
Goals of Retrieval
Accuracy: retrieve documents that the user expects in the answer With as few incorrect answers as possible
All relevant answers are retrieved
Speed: retrievals has to be fast The system responds in real time
Accuracy of Retrieval
Depends on query criteria, query complexity and specificity Attributes / Content criteria?
Depends on what is matched with the query How documents content is represented
Typically feature vectors
Depends also on matching function Euclidean distance
E.G.M Petrakis Multimedia Information
Retrieval 6
E.G.M Petrakis Multimedia Information Retrieval
7
Speed of Retrieval
The query is compared (matched) with all stored documents
The definition of similarity criteria (similarity or distance function between documents) is an important issue: Similarity or distance function between documents Matching has to fast: document matching has to be
computationally efficient and sequential searching must be avoided
Indexing: search documents that are likely to match the query
E.G.M Petrakis Multimedia Information Retrieval
8
Main Idea
Descriptions are extracted and stored Same or separate storage with documents Queries addresses the stored descriptions
rather the documents themselves Images & video: the majority of stored
data
Solution: retrieval by text Text retrieval is a well researched area Most systems work with text, attributes
Doc1-FeactureVector
Doc2-FeactureVector
DocN-FeatureVector
Database
E.G.M Petrakis Multimedia Information Retrieval
9
database
descriptions documents
Doc1
Doc2
DocN
index
query
E.G.M Petrakis Multimedia Information Retrieval
10
Problem
Text descriptions for images and video contained in documents are not available e.g., the text in a web site is not always
descriptive of every particular image contained in the web site
Two approaches human annotations
feature extraction
E.G.M Petrakis Multimedia Information Retrieval
11
Human Annotations
Humans insert attributes, captions for images and videos inconsistent or subjective over time and
users (different users do not give the same descriptions)
retrievals will fail if queries are formulated using different keywords or descriptions
time consuming process, expensive or impossible for large databases
E.G.M Petrakis Multimedia Information Retrieval
12
Feature Extraction
Features are extracted from audio, image and video cheaper and replicable approach
consistent descriptions but
inexact
low level feature (patterns, colors etc.)
difficult to extract meaning
different techniques for different data
E.G.M Petrakis Multimedia Information Retrieval
13
Furht at.al. 96
Architecture of a IR System for Images
E.G.M Petrakis Multimedia Information Retrieval
14
Multimedia IR in Practice
Relies on text descriptions Non-text IR there is nothing similar to text-based
retrieval systems automatic IR is sometimes impossible requires human intervention at some level significant progress have been made domain specific interpretations
E.G.M Petrakis Multimedia Information Retrieval
15
Ranking of IR Methods
Ranking in terms of complexity and accuracy attributes text audio image video
Combined IR based on two or more data types for more accurate retrievals
Complex data types (e.g.,video) are rich sources of information
accuracy complexity
E.G.M Petrakis Multimedia Information Retrieval
16
Similarity Searching
IR is not an exact process two documents are never identical!
they can be “similar”
searching must be “approximate”
The effectiveness of IR depends on the types and correctness of descriptions used
types of queries allowed
user uncertainly as to what he is looking for
efficiency of search techniques
E.G.M Petrakis Multimedia Information Retrieval
17
Approximate IR
A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity
Two common types of similarity queries: range queries: retrieve all documents up to
a distance “threshold” T
nearest-neighbor queries: retrieve the k best matches
E.G.M Petrakis Multimedia Information Retrieval
18
Distance - Similarity
Decide whether two documents are similar distance: the lower the distance, the more
similar the documents are
similarity: the higher the similarity, the more similar the documents are
Key issue for successful retrieval: the more accurate the descriptions are, the more accurate the retrieval is
E.G.M Petrakis Multimedia Information Retrieval
19
Retrieval Quality Criteria
Retrieval with as few errors as possible Two types of errors: false dismissals or misses: qualifying but
non retrieved documents false positives or false drops: retrieved
but not qualifying documents
A good method minimizes both Ranking quality: retrieve qualifying
documents before non-qualifying ones
E.G.M Petrakis Multimedia Information Retrieval
20
Evaluation of Retrieval
“Given a collection of documents, a set of queries and human expert’s responses to the above queries, the ideal system will retrieve exactly what the human dictated”
The deviations from the above are measured
collection the in relevant or number total
answer in relevant retrieved or number=
retrieved of number
answer in relevant retrieved of number=
recall
precision
E.G.M Petrakis Multimedia Information Retrieval
21
Precision-Recall Diagram
Measure precision-recall for 1, 2 ..N answers
High precision means few false alarms
High recall mean few false dismissals
precis
ion
recall
1
1
0,5
0,5
the ideal method
E.G.M Petrakis Multimedia Information Retrieval
22
Harmonic Mean
A single measure combining precision and recall
r: recall, p: precision F takes values in [0,1] F 1 as more retrieved documents are relevant F 0 as few retrieved documents are relevant F is high when both precision and recall are high F expresses a compromise between precision and
recall
pr
F11 +
2=
E.G.M Petrakis Multimedia Information Retrieval
23
Ranking Quality
The higher the Rnorm the better the ability of a method to retrieve correct answer before incorrect
A human expert evaluates the answer set Then take answers in pairs: (relevant,irrelevant) S+ : pairs ranked correctly (the relevant entry
has retrieved before the irrelevant one) S- : pairs ranked incorrectly Smax
+: total ranked pairs
otherwise
SifR S
SS
norm1
0 )1( max-
21
max
-
E.G.M Petrakis Multimedia Information Retrieval
24
Searching Method
Sequential scanning: the query is matched with all stored documents slow if the database is large or if matching is slow
The documents must be indexed hashing, inverted files, B-trees, R-trees different data types are indexed and
searched separately
E.G.M Petrakis Multimedia Information Retrieval
25
Goals of IR
Summarizing, there are two general goals common to all IR systems effectiveness: IR must be accurate
(retrieves what the user expects to see in the answer)
efficiency: IR must be fast (faster than sequential scanning)
E.G.M Petrakis Multimedia Information Retrieval
26
Approximate IR
A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity
Two common types of similarity queries: range queries: retrieve all documents up to
a distance “threshold” T
nearest-neighbor queries: retrieve the k best matches
E.G.M Petrakis Multimedia Information Retrieval
27
Query Formulation
Must be flexible and convenient
SQL queries constraints on all attributes and data types may become very complex
Queries by example e.g., by providing an example document or image
Browsing: display headers, summaries, miniatures etc. or for refining the retrieved results
E.G.M Petrakis Multimedia Information Retrieval
28
Access Methods for Text
Access methods for text are interesting for at least 3 reasons multimedia documents contain text, e.g.,
images often have captions text retrieval has several applications in
itself (library automation, web search etc.) text retrieval research has led to useful
ideas like, vector space model, information filtering, relevance feedback etc.
E.G.M Petrakis Multimedia Information Retrieval
29
Text Queries
Single or multiple keyword queries context queries: phrases, word proximity
boolean: keywords with and, or, not
Natural language: free text queries
Structured search takes also text structure into account and can be: flat or hierarchical for searching in titles,
paragraphs, sections, chapters
hypertext: combines content-connectivity
E.G.M Petrakis Multimedia Information Retrieval
30
Keyword Matching
The whole collection is searched no preprocessing
no space overhead
updates are easy
The user specifies a string (regular expression) and the text is parsed using a finite state automaton KMP, BMH algorithms
E.G.M Petrakis Multimedia Information Retrieval
31
Error Tolerant Methods
Methods that can tolerate errors
Scan text one character at a time by keeping track of matched characters and retrieve strings within a desired editing distance from query Wu, Manber 92, Baeza-Yates, Connet 92
Extension: regular expressions built-up by strings and operators: “pro (blem | tein) (s | ε) | (0 | 1 | 2)”
E.G.M Petrakis Multimedia Information Retrieval
32
Error Counting Methods
Retrieve similar words or phrases e.g., misspelled words or words with different pronunciation
editing distance: a numerical estimate of the similarity between 2 strings
phonetic coding: search words with similar pronunciation (Soundex, Phonix)
N-grams: count common N-length substrings
E.G.M Petrakis Multimedia Information Retrieval
33
Editing Distance
Minimum number of edit operations that are needed to transform the first string to the second edit operation: insert, delete, substitute
and (sometimes) transposition of characters
d(si,ti) : distance between characters usually d(si,ti) = 1 if si < > ti and 0 otherwise edit(cordis,codis) = 1: r is deleted edit(cordis, codris) = 1: transposition of rd
E.G.M Petrakis Multimedia Information Retrieval
34
Damerau-Levenstein Algorithm
Basic recurrence relation (DP algorithm)
edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j; edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1,
edit(i-1,j-1) + d(si,ti),
edit(i-2,j-2) +
d(si,tj-1) + d(si-1,sj) + 1
}
E.G.M. Petrakis Multimedia Information Retrieval
35
Matching Algorithm
Compute D(A,B) #A, #B lengths of A, B 0: null symbol R: cost of an edit operation D(0,0) = 0 for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0); for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]); for i = 0 to #A for j = 0 to #B {
1. m1 = D(i,j-1) + R(0B[j]); 2. m2 = D(i-1,j) + R(A[i] 0); 3. m3 = D(i-1,j-1) + R(A[i] B[j]); 4. D(i,j) = min{m1, m2, m3};
}
E.G.M. Petrakis Multimedia Information Retrieval
36
i 0 1 2 3 4 5 6
j a b b c d d
0 0 1 2 3 4 5 6
1 a 1 0 1 2 3 4 5
2 a 2 1 1 2 3 4 5
3 a 3 2 2 2 3 4 5
4 b 4 3 2 2 3 4 5
5 c 5 4 3 2 2 3 4
6 c 6 5 4 4 3 3 4
7 c 7 6 5 5 4 4 4
8 d 8 7 6 6 5 4 4
initialization cost
total cost
E.G.M Petrakis Multimedia Information Retrieval
37
Classical IR Models
Boolean model simple based on set theory
queries as Boolean expressions
Vector space model queries and documents as vectors in term
space
Probabilistic model a probabilistic approach
E.G.M Petrakis Multimedia Information Retrieval
38
Text Indexing
A document is represented by a set of index terms that summarize document
contents
adjectives, adverbs, connectives are less useful
mainly nouns (lexicon look-up)
requires text pre-processing (off-line)
E.G.M Petrakis Multimedia Information Retrieval
39
Indexing
index
Data Repository
E.G.M Petrakis Multimedia Information Retrieval
40
Text Preprocessing
Extract index terms from text Processing stages word separation, sentence splitting change terms to a standard form (e.g.,
lowercase) eliminate stop-words (e.g. and, is, the, …) reduce terms to their base form (e.g.,
eliminate prefixes, suffixes) construct mapping between terms and
documents (indexing)
E.G.M Petrakis Multimedia Information Retrieval
41
Text Preprocessing Chart
from Baeza – Yates & Ribeiro – Neto, 1999
E.G.M Petrakis Multimedia Information Retrieval
42
Text Indexing Methods
Main indexing methods: inverted files
signature files
bitmaps
Size of index may exceed size of actual collection
compress index to reduce storage
typically 40% of size of collection
E.G.M Petrakis Multimedia Information Retrieval
43
Inverted Files
Dictionary: terms stored alphabetically Posting list: pointers to documents
containing the term Posting info: document id, frequency of
occurrence, location in document, etc. dictionary indexed by B-trees, tries or
binary searched
Pros: fast and easy to implement Cons: large space overhead (up to 300%)
E.G.M Petrakis Multimedia Information Retrieval
44
Inverted Index
άγαλμα αγάπη … δουλειά … πρωί … ωκεανός
index posting list
(1,2)(3,4)
(4,3)(7,5)
(10,3)
1
2
3
4
5
6
7
8
9
10
11
………
documents
E.G.M Petrakis Multimedia Information Retrieval
45
Query Processing
1. Parse query and extract query terms 2. Dictionary lookup binary search
3. Get postings from inverted file one term at a time
4. Accumulate postings record matching document ids, frequencies, etc. compute weights
5. Rank answers by weights (e.g., frequencies)
E.G.M Petrakis Multimedia Information Retrieval
46
Signature Files
A filter for eliminating most of the non-qualifying documents signature: short hash-coded representation
of document or query
the signatures are stored and searched sequentially
search returns all qualifying documents plus some false alarms
the answer is searched to eliminate false alarms
E.G.M Petrakis Multimedia Information Retrieval
47
Superimposed Coding
Each word yields a bit pattern of size F with m bits set to 1 and the rest left as 0 size of F affects the false drop rate
bit patterns are OR-ed to form the doc signature
find documents having 1 in the same bits as the query
word signature
data 001 000 110 010
base 000 010 101 001
document signature 001 010 111 011
E.G.M Petrakis Multimedia Information Retrieval
48
Bitmaps
For every term, a bit vector is stored each bit corresponds to a document
set to 1 if term appears in document, 0 otherwise
e.g., “word” 10100: the “word” appears in documents 1 and 3 in a collection of 5 documents
Very efficient for Boolean queries retrieve bit-vectors for each query term
combine bit vectors with Boolean operators
E.G.M Petrakis Multimedia Information Retrieval
49
Comparison of Text Indexing Methods
Bitmaps: pros: intuitive, easy to implement and to process
cons: space overhead
Signature files: pros: easy to implement, compact index, no lexicon
cons: many unnecessary accesses to documents
Inverted files: pros: used by most search engines
cons: large space overhead, index can be compressed
E.G.M Petrakis Multimedia Information Retrieval
50
References
“Searching Multimedia Databases by Content”, C. Faloutsos, Kluwer Academic Publishers, 1996
“Modern Information Retrieval”, R. Baeza-Yates, B. Ribeiro-Neto, Addison Wesley, 1999
“Automatic Text Processing”, Gerard Salton, Addison Wesley, 1989
Information Retrieval Links: http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html