Post on 15-Jan-2016
2 Information Retrieval
Prof. Dr. Knut Hinkelmann 2Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Motivation
Information Retrieval is a field of activity for many years
It was long seen as an area of narrow interest
Advent of the Web changed this perception universal repository of knowledge
free (low cost) universal access
no central editorial board
many problems though: IR seen as key to finding the solutions!
Prof. Dr. Knut Hinkelmann 3Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Motivation
Information Retrieval: representation, storage, organization of, and access to information items
Emphasis on the retrieval of information (not data)
Focus is on the user information need
User information need Example: Find all documents containing information about car
accidents which happend in Vienna had people injured
The information need is expressed as a query
Prof. Dr. Knut Hinkelmann 4Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Generic Schema of an Information System
Comparison (Ranking)
Information Retrieval systems do not search through the documents but through the representation (also called index, meta-data or description).
source: (Ferber 2004)
representationof resources
(index/meta-data)
representation ofinformation need
(query)
userinformationresources
Prof. Dr. Knut Hinkelmann 5Information Retrieval and Knowledge Organisation - 2 Information Retrieval
document D3
but: not all terms of the query occur in the document the occurring terms „accident“ and „heavy“ also occur in D1
Example
accident heavy vehicles vienna
Heavy accident
Because of a heavy car accident 4 people died yesterday morning in Vienna.
Truck causes accident
In Vienna a trucker drove into a crowd of people. Four people were injured.
More vehicles
In this quarter more cars became registered in Vienna.
D1 D2 D3
Expected result:
Query:
Information need: documents containing information about accidents with heavy vehicles in Vienna
Prof. Dr. Knut Hinkelmann 6Information Retrieval and Knowledge Organisation - 2 Information Retrieval
indexin
gin
dexin
gre
treiv
al (s
earc
h)
retr
eiv
al (s
earc
h)
Retrieval System
Each document represented by a set of representative keywords or index terms
An index term is a document word useful for remembering the document main themes
Ranking
weighteddocuments
set of documents
index
assign IDs
store documentsand IDs
document resources
indexing
terms
Text
queryprocessing
query
terms
interface
answer: sorted list of IDsinformationneed
documents
the index is stored in an efficient system or data structured
Queries are answered using the index
with the ID die document can be retrieved
Prof. Dr. Knut Hinkelmann 7Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Indexing manual indexing – key words
user specifies key words, he/she assumes useful Usually, key words are nouns because nouns have
meaning by themselves there are two possibilities
1. user can assign any terms2. user can select from a predefined set of terms
( controlled vocabulary)
automatic indexing – full text search search engines assume that all words are index
terms (full text representation) system generates index terms from the words
occurring in the text
Prof. Dr. Knut Hinkelmann 8Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Automatic Indexing:1. Decompose a Document into Terms
rules determine how text are decomposed into terms by defining separators like punctuation marks, blanks or
hyphens
Additional preprocessing, e.g.. exclude specific strings (stop
words, numbers) generate normal form
stemming substitute characters (e.g.
upper case – lower case, Umlaut)
D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna
D2: more vehicles in this quarter more cars became registered in vienna
D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured
D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna
D2: more vehicles in this quarter more cars became registered in vienna
D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured
Prof. Dr. Knut Hinkelmann 9Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Automatic Indexing2. Index represented as an inverted List
For each term: list of documents in which the
term occurs additional information can be
stored with each document like frequency of occurrence positions of occurrence
Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…
Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…
In inverted list is similar to an index in a book
Prof. Dr. Knut Hinkelmann 10Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Index as Inverted List with Frequency
term (document; frequency)
a (D1,1) (D3,2)accident (D1,2) (D3,1)became (D2,1)because (D1,1)car (D1,1) cars (D2,1)died (D1,1)heavy (D1,2)in (D1,1) (D2,1) (D3,1)more (D2,1)of (D1,1)people (D1,1) (D3,2)quarter (D2,1)registered (D2,1)truck (D3,1)vehicles (D2,1)...
term (document; frequency)
a (D1,1) (D3,2)accident (D1,2) (D3,1)became (D2,1)because (D1,1)car (D1,1) cars (D2,1)died (D1,1)heavy (D1,2)in (D1,1) (D2,1) (D3,1)more (D2,1)of (D1,1)people (D1,1) (D3,2)quarter (D2,1)registered (D2,1)truck (D3,1)vehicles (D2,1)...
In this example the inverted list contains the document identifier an the frequency of the term in the document.
Prof. Dr. Knut Hinkelmann 11Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Problems of Information Retrieval
Word form A word can occur in different forms, e.g. singular or plural. Example: A query for „car“ should find also documents
containing the word „cars“
Meaning A singular term can have different meanings; on the other hand
the same meaning can be expressed using different terms. Example: when searching for „car“ also documents containing
„vehicle“ should be found.
Wording, phrases The same issue can be expressed in various ways Example: searching for „motorcar“ should also find documents
containing „motorized car“
Prof. Dr. Knut Hinkelmann 12Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Word Forms
Flexion: Conjugation and declension of a wordcar - carsrun – ran - running
Derivations: words having the same stem
form – format – formation
compositions: statements
information management - management of information
In German, compositions are written as single words, sometimes with hyphenInformationsmanagementInformations-Management
Prof. Dr. Knut Hinkelmann 13Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Word Meaning and Phrases
Synonymsrecord – file - dossierseldom – not often
Variants in spelling (e.g BE vs. AE)organisation – organizationnight - nite
AbbrevationsUN – United Nations
Polyseme: words with multiple meaningsBank
Dealing with words having the same or similar meaning
Prof. Dr. Knut Hinkelmann 14Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.1 Dealing with Word Forms and Phrases
We distinguish two ways to deal with word forms and phrases
Indexing without preprocessing All occuring word forms are included in the index Different word forms are unified at search time
string operations
Indexing with preprocessing Unification of word forms during indexing Terms normal forms of occuring word forms index is largely independent of the concrete formulation of the text
computerlinguistic approach
Prof. Dr. Knut Hinkelmann 15Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.1.1 Indexing Without Preprocessing
Index: contains all the word forms occuring in the document
Query: Searching for specific word forms is possible
(e.g. searching for „cars“ but not for „car“)
To search for different word forms string operations can be applied
Operators for truncation and masking, e.g.
? covers exactly one character
* covers arbitrary number of characters Context operators, e.g.
[n] exact distance between terms
<n> maximal distance between terms
Prof. Dr. Knut Hinkelmann 16Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Index Without Preprocessing and Query
vehicle? car? people
Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…
Term Dokument-IDsa D1,D3accident D1,D3became D2because D1car D1 cars D2died D1heavy D1in D1,D2,D3more D2of D1people D1,D3quarter D2registered D2truck D3vehicles D2…
Prof. Dr. Knut Hinkelmann 17Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Truncation and Masking: Searching for Different Word Forms
Truncation: Wildcards cover characters at the beginning and end of words –prefix or suffixschreib* finds schreiben, schreibt, schreibst, schreibe,…
??schreiben finds anschreiben, beschreiben, but not verschreiben
Masking deals with characters in words – in particular in German, declensions and conjugation affect not only suffix and prefixschr??b* can find schreiben, schrieb
h??s* can find Haus, Häuser
Disadvantage: With truncation and masking not only the intended words are foundschr??b* also finds schrauben
h??s* also finds Hans, Hanse, hausen, hassenand also words in other languages like horse
Prof. Dr. Knut Hinkelmann 18Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Context Operators
Context operators allow searching for variations of text phrases
exact word distanceBezug [3] Telefonat
Bezug nehmend auf unser Telefonat
maximal word distancetext <2> retrieval
text retrievaltext and fact retrieval
For context operators to be applicable, the positions of the words must be stored in the index
Prof. Dr. Knut Hinkelmann 19Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Indexing Without Preprocessing
Efficiency Efficient Indexing Overhead at retrieval time to apply string operators
Wort forms user has to codify all possible word forms and phrases in the query
using truncation and masking operators no support given by search engine retrieval engine is language independent
Phrases Variants in text phrases can be coded using context operators
Prof. Dr. Knut Hinkelmann 20Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.1.2 Preprocessing of the Index –Computerlinguistic Approach
Each document is represented by a set of representative keywords or index terms
An index term is a document word useful for remembering the document’s main themes
Index contains standard forms of useful terms1. Restrict allowed terms
2. Normalisation: Map terms to a standard form
Index terms
not forIndex
Prof. Dr. Knut Hinkelmann 21Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Restricting allowed Index Terms
Objective:
increase efficiency effectivity by neglecting terms that do not contribute to the assessment of a document‘s relevance
There are two possibilities to restrict allowed index terms
1. Explicitly specify allowed index terms controlled vocabulary
2. Specify terms that are not allowed as index terms stopwords
Prof. Dr. Knut Hinkelmann 22Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Stop Words Stop words are terms that are not stored in the index Candidates for stop words are
words that occur very frequently A term occurring in every document ist useless as an index term,
because it does not tell anything about which document the user might be interested in
a word which occurs only in 0.001% of the documents is quite useful because it narrows down the space of documents which might be of interest for the user
words with no/little meanings terms that are not words (e.g. numbers)
Examples: General: articles, conjunctions, prepositions, auxiliary verbs (to be,
to have) occur very often and in general have no meaning as a search criteria
application-specific stop words are also possible
Prof. Dr. Knut Hinkelmann 23Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Normalisation of Terms
There are various possibilities to compute standard forms N-Grams
stemming: removing suffixes or prefixes
Prof. Dr. Knut Hinkelmann 24Information Retrieval and Knowledge Organisation - 2 Information Retrieval
N-Grams
Index: sequence of charcters of length N Example: „persons“
3-Grams (N=3): per, ers, rso, son, ons
4-Grams (N=4): pers, erso, rson, sons
N-Grams can also cross word boundaries Example: „persons from switzerland“
3-Grams (N=3): er, ers, rso, son, ons, ns_, s_f, _fr, rom, om_, m_s, _sw, swi, wit, itz, tze, zer, erl, rla, lan, and
Prof. Dr. Knut Hinkelmann 25Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Stemming
Stemming: remove suffixes and prefixes to find a comming stem, e.g. remove –ing and –ed for verbs
remove plural -s for nouns
There are a number of exceptions, e.g. –ing and –ed may belong to a stem as in red or ring
irregular verbs like go - went - gone, run - ran - run
Approaches for stemming: rule-based approach
lexicon-based approach
Prof. Dr. Knut Hinkelmann 26Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Rules for Stemming in English
Ending Replacement Condition
1 ies y
2 XYes XY XY = Co, ch, sh, ss, zz oder Xx 3 XYs XY XY = XC, Xe, Vy, Vo, oa oder ea 4 ies' y
5 Xes' X
6 Xs' X
7 X 's X
8 X' X
9 XYing XY XY= CC, XV, Xx 10 XYing XYe XY= VC 11 ied y
12 XYed XY XY = CC, XV, Xx 13 XYed XYe XY= VC
X and Y are any lettersC stands for a consonantV stands for any vowel
Kuhlen (1977) derived a rule set for stemming of most English words:
Source: (Ferber 2003)
Prof. Dr. Knut Hinkelmann 27Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Problems for Stemming In English a small number of rules cover most of the aorkd
In German it is more difficult because also stem changes for many words insertion of Umlauts, e.g. Haus – Häuser new prefixes, e.g laufen – gelaufen separation/retaining of prefix, e.g.
mitbringen – er brachte den Brief mit überbringen – er überbrachte den Brief
Irregular insertion of „Verfungen“ when building composita Schwein-kram, Schwein-s-haxe, Schwein-e-braten
These problems that cannot be easily dealt with by general rules operating only on strings
Prof. Dr. Knut Hinkelmann 28Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Lexicon-based Approaches for Stemming
Principle idea: A lexicon contains stems for word forms
complete lexicon: for each possible form the stem is storedpersons – person went – gorunning – run going – goran – run gone - go
word stem lexicon: to each stem all the necessary data are stored to derive all word forms
Distinction of different flexion classes specification of anomalies Example: To compute the stem of Flüssen, the last characters
are removed successively and the Umlaut is exchanged until a valid stem is found (Lezius 1995)Fall/Endung - n en sen ...
normal Flüssen- Flüsse-n Flüss-en Flüs-sen ...
Umlaut Flussen- Flusse-n Fluss-en Flus-sen ...Source: (Ferber 2003)
Prof. Dr. Knut Hinkelmann 29Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Index with Stemming and Stop Word Elimination
Terms Document IDs
accident D1,D3car D1, D2cause D3crowd D3die D1drive D3four D3heavy D1injur D3more D2morning D1people D1, D3quarter D2register D2truck D3trucker D3vehicle D2vienna D1, D2, D3yesterday D1
Index:
D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna
D2: more vehicles in this quarter more cars became registered in vienna
D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured
D1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna
D2: more vehicles in this quarter more cars became registered in vienna
D3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured
Prof. Dr. Knut Hinkelmann 30Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.2 Classical Information Retrieval Models
Classcial Models Boolean Model
Vectorspace model
Probabilistic Model
Alternative Models user preferences
Associative Search
Social Filtering
Prof. Dr. Knut Hinkelmann 31Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Classic IR Models - Basic Concepts
Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents
The importance of the index terms is represented by weights associated to them
Let ti be an index termdj be a documentwij is a weight associated with (ti,dj)
The weight wij quantifies the importance of the index term for describing the document contents
(Stop words can be regarded as terms where wij = 0 for every document)
(Baeza-Yates & Ribeirp-Neto 1999)
Prof. Dr. Knut Hinkelmann 32Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Classic IR Models - Basic Concepts
ti is an index term
dj is a document
n is the total number of docs
T = (t1, t2, …, tk) is the set of all index terms
wij >= 0 is a weight associated with (ti,dj)
wij = 0 indicates that term does not belong to doc
vec(dj) = (w1j, w2j, …, wkj) is a weighted vector associated with the document dj
gi(vec(dj)) = wij is a function which returns the weight associated with pair (ti,dj)
fi is the number of documents containing term tisource: teaching material of Ribeirp-Neto
Prof. Dr. Knut Hinkelmann 33Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Index vectors as Matrix
the vectors vec(dj) = (w1j, w2j, …, wkj) associated with the document dj can be represented as a matrix
Each colunm represents a document vector
vec(dj) = (w1j, w2j, …, wkj) the document dj contains a term ti if
wij > 0
Each row represents a term vector tvec(ti) = (wi1, wi2, …, win) the term ti is in document dj if wij > 0
d1 d2 d3 d4
t1 w1,1 w1,2 w1,3 w1,4
t2 w2,1 w2,2 w2,3 w2,4
t3 w3,1 w3,2 w3,3 w3,4
...
tn wn,1 wn,2 wn,3 wn,4
d1 d2 d3 d4
t1 w1,1 w1,2 w1,3 w1,4
t2 w2,1 w2,2 w2,3 w2,4
t3 w3,1 w3,2 w3,3 w3,4
...
tn wn,1 wn,2 wn,3 wn,4
Prof. Dr. Knut Hinkelmann 34Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Boolean Document Vectors
d1 d2 d3
accident 1 0 1car 1 1 0cause 0 0 1crowd 0 0 1die 1 0 0drive 0 0 1four 0 0 1heavy 1 0 0injur 0 0 1more 0 1 0morning 1 0 0people 1 0 1quarter 0 1 0register 0 1 0truck 0 0 1trucker 0 0 1vehicle 0 1 0vienna 1 1 1yesterday 1 0 0
d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna
d2: more vehicles in this quarter more cars became registered in vienna
d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured
d1: heavy accident because of a heavy car accident 4 people died yesterday morning in vienna
d2: more vehicles in this quarter more cars became registered in vienna
d3: Truck causes accident in vienna a trucker drove into a crowd of people four people were injured
Prof. Dr. Knut Hinkelmann 35Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.2.1 The Boolean Model
Simple model based on set theory precise semantics
neat formalism
Binary index: Terms are either present or absent. Thus, wij {0,1}
Queries are specified as boolean expressions using operators AND (), OR (), and NOT () q = ta (tb tc)
vehicle OR car AND accident
Prof. Dr. Knut Hinkelmann 36Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Boolean Retrieval Function
The retrieval function can be defined recursivley
R(ti,di) = TRUE, if wij = 1 (i.e. ti is in dj )= FALSE, if wij = 0 (i.e. ti is not in dj )
R(q1 AND q2,di) = R(q1,di) AND R(q2,di)
R(q1 OR q2,di) = R(q1,di) OR R(q2,di)
R(NOT q,di) = NOT R(q,di)
The Boolean functions computes only values 0 or 1, i.e. Boolean retrieval classifies documents into two categories relevant (R = 1) irrelevant (R = 0)
Prof. Dr. Knut Hinkelmann 37Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Example für Boolesches Retrieval
(vehicle OR car) AND accident
Query:
R(vehicle OR car AND accident, d1) =R(vehicle OR car AND accident, d2) =R(vehicle OR car AND accident, d3) =
100
d1 d2 d3
accident 1 0 1car 1 1 0cause 0 0 1crowd 0 0 1die 1 0 0drive 0 0 1four 0 0 1heavy 1 0 0injur 0 0 1more 0 1 0morning 1 0 0people 1 0 1quarter 0 1 0register 0 1 0truck 0 0 1trucker 0 0 1vehicle 0 1 0vienna 1 1 1yesterday 1 0 0
(vehicle AND car) OR accident
Query:
R(vehicle AND car OR accident, d1) =R(vehicle AND car OR accident, d2) =R(vehicle AND car OR accident, d3) =
010
Prof. Dr. Knut Hinkelmann 38Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Drawbacks of the Boolean Model Retrieval based on binary decision criteria
no notion of partial matching No ranking of the documents is provided (absence of a
grading scale) The query q = t1 OR t2 OR t3 is satisfied by document
containing one, two or three of the terms t1, t2, t3
No weighting of terms, wij {0,1} Information need has to be translated into a Boolean
expression which most users find awkward The Boolean queries formulated by the users are most
often too simplistic As a consequence, the Boolean model frequently returns
either too few or too many documents in response to a user query
Prof. Dr. Knut Hinkelmann 39Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.2.2 Vector Space Model Index can be regarded as an n-
dimensional space wij > 0 whenever ti dj
Each term corresponds to a dimension
To each term ti is associated a unitary vector vec(i)
The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
document can be regarded as
vector started from (0,0,0)
point in space
(4,3,1)
vehicle
accident
car
(3,2,3)
d1 d2accident 4 3car 3 2vehicle 1 3
Example:
Prof. Dr. Knut Hinkelmann 40Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.2.2.1 Coordinate Matching Documents and query are represented as
document vectors vec(dj) = (w1j, w2j, …, wkj) query vector vec(q) = (w1q,...,wkq)
Vectors have binary values wij = 1 if term ti occurs in Dokument dj wij = 0 else
Ranking: Return the documents containing at least one query term rank by number of occuring query terms
Ranking function: scalar product R(q,d) = q * d
= qi * dii=1
n
Multiply components and summarize
Prof. Dr. Knut Hinkelmann 41Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Coordinate Matching: Example
Resultat:
q * d1 =
q * d2 =
q * d3 =
322
query vector represents terms of the query (cf. stemming)
accident heavy vehicles vienna
d1 d2 d3 q
accident 1 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 1 0 0 1injur 0 0 1 0more 0 1 0 0morning 1 0 0 0people 1 0 1 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0
Prof. Dr. Knut Hinkelmann 42Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Assessment of Coordinate Matching
Advantage compared to Boolean Model: Ranking
Three main drawbacks
frequency of terms in documents in not considered
no weighting of terms
privilege for larger documents
Prof. Dr. Knut Hinkelmann 43Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.2.2.2 Term Weighting Use of binary weights is too limiting
Non-binary weights provide consideration for partial matches
These term weights are used to compute a degree of similarity between a query and each document
How to compute the weights wij and wiq ?
A good weight must take into account two effects: quantification of intra-document contents (similarity)
tf factor, the term frequency within a document
quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency
wij = tf(i,j) * idf(i) (Baeza-Yates & Ribeirp-Neto 1999)
Prof. Dr. Knut Hinkelmann 44Information Retrieval and Knowledge Organisation - 2 Information Retrieval
TF - Term Frequency
Let freq(i,j) be the raw frequency of term ti within document dj (i.e. number of occurrences of term ti in document dj)
A simple tf factor can be computed as
f(i,j) = freq(i,j)
A normalized tf factor is given by
f(i,j) = freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms which occur within the document dj
d1 d2 d3 q
accident 2 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 2 0 0 1injur 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0
(Baeza-Yates & Ribeiro-Neto 1999)For reasons of simplicity, in this example f(i,j) = freq(i,j)
Prof. Dr. Knut Hinkelmann 45Information Retrieval and Knowledge Organisation - 2 Information Retrieval
IDF – Inverse Document Frequency
IDF can also be interpreted as the amount of information associated with the term ti . A term occurring in few documents is more useful as an index term than a term occurring in nearly every document
Let ni be the number of documents containing term tiN be the total number of documents
A simple idf factor can be computed as idf(i) = 1/ni
A normalized idf factor is given by idf(i) = log (N/ni)
the log is used to make the values of tf and idf comparable.
Prof. Dr. Knut Hinkelmann 46Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Example with TF and IDF
In this examle a simple tf factor
f(i,j) = freq(i,j)
and a simple idf factor
idf(i) = 1/ni
are used
It is of advantage to store IDF and TF separately
IDF d1 d2 d3
accident 0.5 2 0 1car 0.5 1 1 0cause 1 0 0 1crowd 1 0 0 1die 1 1 0 0drive 1 0 0 1four 1 0 0 1heavy 1 2 0 0injur 1 0 0 1more 1 0 2 0morning 1 1 0 0people 0.5 1 0 2quarter 1 0 1 0register 1 0 1 0truck 1 0 0 1trucker 1 0 0 1vehicle 1 0 1 0vienna 0.33 1 1 1yesterday 1 1 0 0
Prof. Dr. Knut Hinkelmann 47Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Indexing a new Document
Changes of the indexes when adding a new document d a new document vector with tf factors for d is created
idf factors for terms occuring in d are adapted
All other document vectors remain unchanged
Prof. Dr. Knut Hinkelmann 48Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Ranking
Scalar product computes co-occurrences of term in document and query
Drawback: Scalar product privileges large documents over small ones
Euclidian distance between endpoint of vectors Drawback: euclidian distance privileges small
documents over large ones
Angle between vectors the smaller the angle beween query and document
vector the more similar they are
the angle is independent of the size of the document
the cosine is a good measure of the angle
t1
t2
q d
Prof. Dr. Knut Hinkelmann 49Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Cosine Ranking Formula
the more the directions of query a and document dj coincide the more relevant is dj
the cosine formula takes into account the ratio of the terms not their concrete number
Let be the angle between q and dj
Because all values wij >= 0 the angle is between 0° und 90°
the larger the less is cos the less the larger is cos cos 0 = 1 cos 90° = 0
t1
t2
q dj
cos(q,dj) = q ° dj
|q| ° |dj|
Prof. Dr. Knut Hinkelmann 50Information Retrieval and Knowledge Organisation - 2 Information Retrieval
The Vector Model
The best term-weighting schemes use weights which are given by wij = f(i,j) * log(N/ni)
the strategy is called a tf-idf weighting scheme
For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)
(Baeza-Yates & Ribeirp-Neto 1999)
Prof. Dr. Knut Hinkelmann 51Information Retrieval and Knowledge Organisation - 2 Information Retrieval
The Vector Model The vector model with tf-idf weights is a good ranking
strategy with general collections
The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.
Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate
the query conditions cosine ranking formula sorts documents according to
degree of similarity to the query
Disadvantages: assumes independence of index terms (??); not clear that
this is bad though (Baeza-Yates & Ribeiro-Neto 1999)
Prof. Dr. Knut Hinkelmann 52Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.2.3 Extensions of the Classical Models
Combination of Boolean model
vector model
indexing with and without preprocessing
Extended index with additional information like document format (.doc, .pdf, …)
language
Using information about links in hypertext link structure
anchor text
Prof. Dr. Knut Hinkelmann 53Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Boolean Operators in the Vector Model
Many search engines allow queries with Boolean operators
Retrieval:
Boolean operators are used to select relevant documents
in the example, only documents containing „accident“ and either „vehicle“ or „car“are considered relevant
ranking of the relevant documents is based on vector model
idf-tf weighting cosine ranking formula
(vehicle OR car) AND accident
d1 d2 d3 q
accident 2 0 1 1car 1 1 0 0cause 0 0 1 0crowd 0 0 1 0die 1 0 0 0drive 0 0 1 0four 0 0 1 0heavy 2 0 0 1injur 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0register 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicle 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0
Prof. Dr. Knut Hinkelmann 54Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Queries with Wild Cards in the Vector Model
vector model based in index without preprocessing
index contains all word forms occuring in the documents
Queries allow wildcards (masking and truncation), e.g.
Principle of query answering
First, wildcards are extended to all matching terms (here vehicle* matches „vehicles“)
ranking according to vector model
d1 d2 d3 q
accident 2 0 1 1car 1 0 0 0cars 0 1 0 0causes 0 0 1 0crowd 0 0 1 0died 1 0 0 0drove 0 0 1 0four 0 0 1 0heavy 2 0 0 1injured 0 0 1 0more 0 2 0 0morning 1 0 0 0people 1 0 2 0quarter 0 1 0 0registered 0 1 0 0truck 0 0 1 0trucker 0 0 1 0vehicles 0 1 0 1vienna 1 1 1 1yesterday 1 0 0 0
accident heavy vehicle* vienna
Prof. Dr. Knut Hinkelmann 55Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Using Link Information in Hypertext
Ranking: link structure is used to calculate a quality ranking for each web page
PageRank®
HITS – Hypertext Induced Topic Selection (Authority and Hub)
Hilltop
Indexing: text of a link (anchor text) is associated both with the page the link is on and
with the page the link points to
Prof. Dr. Knut Hinkelmann 56Information Retrieval and Knowledge Organisation - 2 Information Retrieval
The PageRank Calculation
1) S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Computer Networks and ISDN Systems. Vol. 30, 1998, Seiten 107-117http://www-db.stanford.edu/~backrub/google.html oder http://infolab.stanford.edu/pub/papers/google.pdf
PageRank has been developed by Sergey Brin and Lawrence Page at Stanford University and published in 19981)
PageRank uses the link structure of web pages
Original version of PageRank calculation:
with
PR(A) being the PageRank of page A,
PR(Ti) being the PageRank of apges Ti that contain a link to page A
C(Ti) being the number of links going out of page Ti
d being a damping factor with 0 <= d <= 1
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Prof. Dr. Knut Hinkelmann 57Information Retrieval and Knowledge Organisation - 2 Information Retrieval
The PageRank Calculation - Explanation
The PageRank of page A is recursively defined by the PageRanks of those pages which link to page A
The PageRank of a page Ti is always weighted by the number of outbound links C(Ti) on page Ti: This means that the more outbound links a page Ti has, the less will page A benefit from a link to it on page Ti.
The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank.
Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1.
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Source: http://pr.efactory.de/e-pagerank-algorithm.shtml
Prof. Dr. Knut Hinkelmann 58Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Damping Factor and the Random Surfer Model
The PageRank algorithm and the damping factor are motivated by the model of a random surfer. The random surfer finds a page A by
following a link from a page Ti to page A or by random choice of a web page (e.g. typing the URL).
The probability that the random surfer clicks on a particular link is given by the number of links on that page: If a page Ti contains C(Ti) links, the probability for each links is 1/ C(Ti)
The justification of the damping factor is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.
d is the probability for the random surfer not stopping to click on links – this is way the sum of pageRanks is multiplied by d
(1-d) is the probability that the surfer jumps to another page at random after he stopped clicking links.Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank
(According to Brin and Page d = 0.85 is a good value)
Source: http://pr.efactory.de/e-pagerank-algorithm.shtml
Prof. Dr. Knut Hinkelmann 59Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Calculation of the PageRank - Example
We regard a small web consisting of only three pages A, B and C and the links structure shon in the figure
To keep the calculation simple d is set to 0.5
These are the equation for the PageRank calculation:
PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
Solving these equations we get the following PageRank values for the single pages:
PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615
Quelle: http://pr.efactory.de/e-pagerank-algorithmus.shtml
Prof. Dr. Knut Hinkelmann 60Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Iterative Calculation of the PageRank - Example
Iteration PR(A) PR(B) PR(C)0 1 1 11 1 0.75 1.1252 1.0625 0.765625 1.14843753 1.07421875 0.76855469 1.152832034 1.07641602 0.76910400 1.153656015 1.07682800 0.76920700 1.153810506 1.07690525 0.76922631 1.153839477 1.07691973 0.76922993 1.153844908 1.07692245 0.76923061 1.153845929 1.07692296 0.76923074 1.15384611
10 1.07692305 0.76923076 1.1538461511 1.07692307 0.76923077 1.1538461512 1.07692308 0.76923077 1.15384615
According to Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web.
Quelle: http://pr.efactory.de/d-pagerank-algorithmus.shtml
Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values
• each page is assigned an initial starting value• the PageRanks of all pages are then calculated in several computation cycles.
Prof. Dr. Knut Hinkelmann 61Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Alternative Link Analysis Algorithms (I): HITS
Jon Kleinberg: Authoritative sources in a hyperlinked environment. In: Journal of the ACM, Vol. 36, No. 5, pp. 604-632, 1999, http://www.cs.cornell.edu/home/kleinber/auth.pdf
Hypertext-Induced Topic Selection (HITS) is a link analysis algorithm proposed by J. Kleinberg 1999
HITS rates Web pages for their authority and hub values:
The authority value estimates the value of the content of the page; a good authority is a page that is pointed to by many good hubs
the hub value estimates the value of its links to other pages; a good hub is a page that points to many good authorities (examples of hubs are good link collections);
Every page i is assigned a hub weight hi and an Authority weight ai :
Prof. Dr. Knut Hinkelmann 62Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Alternative Link Analysis Algorithms (II): Hilltop The Hilltop-Algorithm1) rates documents based on their incoming links
from so-called expert pages
Expert pages are defined as pages that are about a topic and have links to many non-affiliated pages on that topic.
Pages are defined as non-affiliated if they are from authors of non-affiliated organisations.
Websites which have backlinks from many of the best expert pages are authorities and are ranked high.
A good directory page is an example of an expert page (cp. hubs).
Determination of expert pages is a central point of the hilltop algorithm.
1) The Hilltop-Algorithmus was developed by Bharat und Mihaila an publishes in 1999:Krishna Bharat, George A. Mihaila: Hilltop: A Search Engine based on Expert Documents. In 2003 Google bought the patent of the algorithm(see also http://pagerank.suchmaschinen-doktor.de/hilltop.html)
Prof. Dr. Knut Hinkelmann 63Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Anchor-Text The Google search engine uses the text of links
twice First, the text of a link is associated with the
page that the link is on, In addition, it is associated with the page
the link points to. Advantages:
Anchors provide additional description of a web pages – from a user‘s point of view
Documents without text can be indexed, such as images, programs, and databases.
Disadvantage: Search results can be manipulated
(cf. Google Bombing1))
A Google bomb influences the ranking of the search engine. It is created if a large number of sites link to the page with anchor text that often has humourous, political or defamatory statements. In the meanwhile, Google bombs are defused by Google.
The polar bear Knut was born in the zoo of Berlin
Prof. Dr. Knut Hinkelmann 64Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Natural Language Queries
Natural language queries are treated as any other query
Stop word elimination
Stemming
but no interpretation of the meaning of the query
i need information about accidents with cars and other vehicles
is equivalent to
information accident car vehicle
Prof. Dr. Knut Hinkelmann 65Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Searching Similar Documents
Is is often difficult to express the information need as a query
An alternative search method can be to search for similar documents to a given document d
Prof. Dr. Knut Hinkelmann 66Information Retrieval and Knowledge Organisation - 2 Information Retrieval
IDF d1 d2 d3
accident 0.5 2 0 1car 0.5 1 1 0cause 1 0 0 1crowd 1 0 0 1die 1 1 0 0drive 1 0 0 1four 1 0 0 1heavy 1 2 0 0injur 1 0 0 1more 1 0 2 0morning 1 1 0 0people 0.5 1 0 2quarter 1 0 1 0register 1 0 1 0truck 1 0 0 1trucker 1 0 0 1vehicle 1 0 1 0vienna 0.33 1 1 1yesterday 1 1 0 0
Finding Similar Documents – Principle and ExampleExample: Find the most similar documents to d1
Principle: Use a given document d as a query
Compare all document di with d
Example (scalar product):
IDF * d1 * d2 =IDF * d1 * d3 =
The approach is the same as for a : same index same ranking function
0.832.33
Prof. Dr. Knut Hinkelmann 67Information Retrieval and Knowledge Organisation - 2 Information Retrieval
The Vector Space Model
The vector space model ...… is relatively simple and clear,… is efficient,… ranks documents,… can be applied for any collection of documents
The model has many heuristic components of parameters, e.g. determintation of index terms calculation of tf and idf ranking function
The best parameter setting depends on the document collection
Prof. Dr. Knut Hinkelmann 68Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.3 Implementation of the Index The vector space model is usually
implemented with an inverted index
For each term a pointer references a „posting list“ with an entry for each document containing the term
The posting lists can be implemented as linked lists or
more efficient data structures that reduce the storage requirements (index pruning)
To answer a query, the corresponding posting lists are retrieved and the documents are ranked, i.e. efficient retrieval of posting list is essential
Source: D. Grossman, O. Frieder (2004) Information Retrieval, Springer-Verlag
Prof. Dr. Knut Hinkelmann 69Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Implementing the Term Structure as a Trie
Sequentially scanning the index for query terms/posting lists is inefficient
a trie is a tree structure
each node is an array,one element for eachcharacter
each element contains a link to another node
*) the characters and their order are identical for each node. Therefore they do not need to be stored explicitly.
Example: Structure of a node in a trie*)
Source: G. Saake, K.-U. Sattler: Algorithmen und Datenstrukturen – Eine Einführung mit Java. dpunkt Verlag 2004
Prof. Dr. Knut Hinkelmann 70Information Retrieval and Knowledge Organisation - 2 Information Retrieval
The Index as a Trie
The leaves of the trie are the index terms, pointing to the corresponding position lists
Searching a term in in a trie:
search starts at the root
subsequently for each character of the term the reference to the corresponding subtree is followed until
a leave with the term is found search stops without success
(Saake, Sattler 2004)
Prof. Dr. Knut Hinkelmann 71Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Patricia Trees
Idea: Skip irrelevant parts of terms
This is achieved by storing in each node the number of characters to be skipped.
Example:
(Saake, Sattler 2004)Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric
Prof. Dr. Knut Hinkelmann 72Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.4 Evaluating Search Methods
Set of alldocument
relevant documentsthat are not found
documentsfound
relevant documents found
Prof. Dr. Knut Hinkelmann 73Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Performance Measure of Information Retrieval:Recall und Precision
Several different measures for evaluating the performance of information retrieval systems have been proposed; two important ones are:
Recall: fraction of the relevant documents that are are successfully retrieved.
answer set DA
relevant documents DR
relevant documents in answer set DRA
Precision: fraction of the documents retrieved that are relevant to the user's information need
R =|DRA|
|DR|
P =|DRA||DA|
D
Prof. Dr. Knut Hinkelmann 74Information Retrieval and Knowledge Organisation - 2 Information Retrieval
F-Measure
The F-measure is a mean of precision and recall
In this version, precision and recall are equally weighted.
The more general version allows to give preference to recall or precision
F2 weights recall twice as much as precision
F0.5 weights precision twice as much as recall
Prof. Dr. Knut Hinkelmann 75Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Computing Recall and Precision Evaluation: Perform a predefined set of queries
The search engines delivers a ranked set of documents Use the first X documents of the result list as answer set Compute recall and precision for the frist X documents of the ranked
result list.
How do you know, which documents are relevant?
1. A general reference set of documents can be used. For example, TREC (Text REtrieval Conference) is an annual event where large test collections in different domains are used to measure and compare performance of infomration retrieval systems
2. For companies it is more important to evaluate information retrieval systems using their own documents1. Collect a representative set of documents
2. Specify queries and associated relevant documents
3. evaluate search engines by computing recall and precision for the query results
Prof. Dr. Knut Hinkelmann 76Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.5 User Adaptation
Take into account information of a user to filter document particularly relevant to this user
Relevance Feedback Retrieval in multiple passes; in each pass the use refines
the query based on results of previous queries
Explicit User Profiles subscription User-specific weights of terms
Social Filtering Similar use get similar documents
Prof. Dr. Knut Hinkelmann 77Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.5.1 Relevance Feedback given by the User
The user specifies relevance of each document.Example: for query "Pisa" only the documents about the education assessment are regarded as relevant
In the next pass, the top ranked documents are only about the education assessment
This example is from the SmartFinder system from empolis. The mindaccess system from Insiders GmbH uses the same techology.
Example:
Prof. Dr. Knut Hinkelmann 78Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Relevance Feedback: Probabilistic Model Assumption: Given a user query, there is an ideal
answer set
Idea: An initial answer is iteratively improved based on user feedback
Approach: An initial set of documents is retrieved
somehow User inspects these docs looking for the
relevant ones (usually, only top 10-20 need to be inspected)
IR system uses this information to refine description of ideal answer set
By repeting this process, it is expected that the description of the ideal answer set will improve
The description of ideal answer set is modeled in probabilistic terms
(Baeza-Yates & Ribeiro-Neto 1999)
Prof. Dr. Knut Hinkelmann 79Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Probabilistic Ranking
Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).
The model assumes that this probability of relevance depends on the query and the document representations only.
Probabilistic ranking is:
Definitions:
wij {0,1} (i.e. weights are binary)
similarity of document dj to the query q
is the document vector of dj
is the probability that document dj is relevant
is the probability that document dj is not relevant
(Baeza-Yates & Ribeiro-Neto 1999)
Prof. Dr. Knut Hinkelmann 80Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Computing Probabilistic Ranking
Probabilistic ranking can be computed as:
wherestands for the probability that the index term ki is present in a document randomly selected fromthe set R of relevant documents
is the weight of term ki in the query
stands for the probability that the index term ki
is not present in a document randomly selected fromthe set R of relevant documents
is the weight of term ki in document dj(Baeza-Yates & Ribeiro-Neto 1999)
Prof. Dr. Knut Hinkelmann 81Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Relevance Feedback: Probabilistic Model
The probabilities that a term ki is (not) present in a set of relevant documents can be computed as :
N total number of documents
ni number of documents containing term ki
V number of relevant documents retrieved by the probabilistic model
Vi number of relevant documents containing term ki
There are different ways to find the relevant document V : Automatically: V can be specified as the top r
documents found By user feedback: The user specifies for each retrieved
document whether it is relevant or not
Prof. Dr. Knut Hinkelmann 82Information Retrieval and Knowledge Organisation - 2 Information Retrieval
2.5.2 Explicit User Profiles
Idea: Using knoweldge about the user to provide information that is particularly relelvant for him/her
users specify topics of interest as a set of terms these terms represent the user profile documents containing the terms of the user profile are prefered
information need/preferences documents
user profile index
profileacquisition
document representation
ranking function
Prof. Dr. Knut Hinkelmann 83Information Retrieval and Knowledge Organisation - 2 Information Retrieval
User profiles for subscribing to information user profiles are treated
as queries
Example: news feedAs soon as a new document arrives it is tested for similarity with the user profiles
Vector space model can be applied
A document is regarded relevant if the ranking reaches a specified threshold
Example: User 1 is interested in any car accicentUser 2 is interested in deadly car accidents with trucks
IDF d1 d2 d3 U1 U2
accident 0.5 2 0 1 1 1car 0.5 1 1 0 1 0cause 1 0 0 1 0 0crowd 1 0 0 1 0 0die 1 1 0 0 0 1drive 1 0 0 1 0 0four 1 0 0 1 0 0heavy 1 2 0 0 0 0injur 1 0 0 1 0 0more 1 0 2 0 0 0morning 1 1 0 0 0 0people 0.5 1 0 2 0 0quarter 1 0 1 0 0 0register 1 0 1 0 0 0truck 1 0 0 1 1 1trucker 1 0 0 1 0 0vehicle 1 0 1 0 1 0vienna 0.33 1 1 1 0 0yesterday 1 1 0 0 0 0
Prof. Dr. Knut Hinkelmann 84Information Retrieval and Knowledge Organisation - 2 Information Retrieval
User profiles for Individual Queries Users specify importance
of terms
User profiles are used as additional term weights
Different ranking for different users
Example
ranking for user 1
IDF * d1 * U1 *q =IDF * d2 * U1 *q =IDF * d3 * U1 *q =
Example: users profiles with term
1,410,5
2,20,10,5
ranking for user 2
IDF * d1 * U2 *q =IDF * d2 * U2 *q =IDF * d3 * U2 *q =
IDF d1 d2 d3 U1 U2 q
accident 0.5 2 0 1 1 1 1car 0.5 1 1 0 0.8 0.2 0cause 1 0 0 1 0 0 0crowd 1 0 0 1 0 0 0die 1 1 0 0 0 0.8 0drive 1 0 0 1 0 0 0four 1 0 0 1 0 0 0heavy 1 2 0 0 0.2 0.6 1injur 1 0 0 1 0 0 0more 1 0 2 0 0 0 0morning 1 1 0 0 0 0 0people 0.5 1 0 2 0.5 0.8 0quarter 1 0 1 0 0 0 0register 1 0 1 0 0 0 0truck 1 0 0 1 0.6 1 0trucker 1 0 0 1 0 0.6 0vehicle 1 0 1 0 1 0.1 1vienna 0.33 1 1 1 0 0 1yesterday 1 1 0 0 0 0 0
Prof. Dr. Knut Hinkelmann 85Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Acquisition and Maintenance of User Profiles
There are different ways to specify user profiles
manual: users specifies topics of interests (and weights) explicitly selection of predefined terms or query Problem: Maintenance
user feedback: user collects relevant documents terms in selected document are regarded as important Problem: How to motivate the user to give feedback (a similar approach is used by spam filters - classification)
Heuristics: observing user behaviour Example: If a user has opened a document for long time, it is
assumend that he/she read it and therefore it might be relevant Problem: Heuristics might be wrong
Prof. Dr. Knut Hinkelmann 86Information Retrieval and Knowledge Organisation - 2 Information Retrieval
Social Filtering
Idea: Information is relevant, if other users who showed similar behaviour regarded the information as relevant Relevance is specified by the users
User profiles are compared
Example: A simple variant can be found at Amazon purchases of books and CDs are stored
„people who bought this book also bought …“