1
COMP791A: Statistical Language Processing
Information Retrieval[M&S] 15.1-15.2
[J&M] 17.3
2
The problemThe standard information retrieval (IR) scenario
The user has an information need The user types a query that describes the information need The IR system retrieves a set of documents from a document
collection that it believes to be relevant The documents are ranked according to their likelihood of
being relevant
input: a (large) set/collection of documents a user query
output: a (ranked) list of relevant documents
3
Example of IR
4
5
IR within NLP IR needs to process the large volumes of online text And (traditionally), NLP methods were not robust
enough to work on thousands of real world texts. so IR:
not based on NLP tools (ex. syntactic/semantic analysis) uses (mostly) simple (shallow) techniques based mostly on word frequencies
in IR, meaning of documents: is the composition of meaning of individual words ordering & constituency of words play are not taken into
account bag of word approach
I see what I eat.I eat what I see.
same meaning
6
2 major topics
Indexing representing the document collection using words/terms for fast access to documents
Retrieval methods matching a user query to indexed documents 3 major models:
1. boolean model 2. vector-space model3. probabilistic model
7
Indexing Most IR systems use an inverted file to represent the texts
in the collection Inverted file = a table of terms with a list of texts that
contain these terms
assassination {d1, d4, d95, d5, d90…}
murder {d3, d7, d95…}
Kennedy {d24, d7, d44…}
conspiracy {d3, d55, d90, d98…}
8
Example of an inverted file For each term:
DocCnt: how many documents the term occurs in (used to compute IDF)
FreqCnt: how many times the term occurs in all documents
For each document: Freq: how many times the term occurs in this doc WordPosition: the offsets where these occurrences
are found in the document useful to search for terms within n words of each other to approximate phrases (ex. “car insurance”)
but… primitive notion of phrases… just word/byte position in document
“car insurance” “ insurance for car” to generate word-in-context snippets to highlight terms in the retrieved document …
9
Basic Concept of a Retrieval Model documents and queries are represented by vectors of pairs <term-value>
term: all possible terms that occur in the query/document value: presence or absence of term in query/document
value can be binary (0, if term is absent ; 1, if term is present) some weigh (term frequency, tf.idf, or other)T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
10
Vector-Space Model
binary values do not tell if a term is more important than others
so we should weight the terms by importance weight of terms (for document & query) can
be their raw frequency or other measure
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
11
Term-by-document matrix the collection of documents is represented by a matrix of
weights called a term-by-document matrix
1 column = representation of one document 1 row = representation of 1 term across all documents cell wij = weight of term i in document j note: the matrix is sparse !!!
d1 d2 d3 d4 d5 …
term1 w11 w12 w13 w14 w15
term2 w21 w22 w23 w24 w25
term3 w31 w32 w33 w34 w35
…
TermN wn1 wn2 wn3 wn4 wn5
12
An example The collection:
d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary}
d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary}
d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other languages regular languages ? is natural language context-free complexity and human processing summary}
The query:Q = {speech language processing}
13
An example (con’t) The collection:
d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary}
d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary}
d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other language regular language ? is natural language context-free complexity and human processing summary}
The query:Q = {speech language processing}
14
d1 d2 d3 Q introduction … … … … knowledge … … … … … … … … … speech 1 6 0 1 language 2 0 5 1 processing 1 1 1 1 … … … … …
using raw term frequencies
vectors for the documents and the query can be seen as a point in a multi-dimensional space
where each dimension is a term from the query
An example (con’t)
Term
1
(speech
)
Term
3
(processing)
Term 2 (language)
d2 (6,0,1)
d1 (1,2,1)
d3 (0,5,1)q (1,1,1)
15
Document similarity
The longer the document, the more chances it will be retrieved: it makes sense, because it may contain many of the
query's terms but then again, it may also contain lots of non-
pertinent terms…
we want to consider: vector (1, 2, 1) vector (2, 4, 2) (same distribution of words)
we can normalise raw term frequencies to convert all vectors to a standard length (ex. 1)
16
Example
Query = speech language
original representation:
lan
gu
ag
e
speech
d2 (6, 0)
d1 (1, 2)
d3 (0, 5)
q (1, 1)
Normalization: - length of vector does not matter,- angle does.
17
The cosine measure
similarity between two documents (or doc & query) is actually the cosine of the angle (in N-dimensions) between the 2 vectors
if 2 document-vectors are identical, they will have a cosine of 1
if 2 document-vectors are orthogonal (i.e. share no common term), they will have a cosine of 0
(Q) Query
(D) Document
(Q) Query
(D) Document
0Q)cos(D, 1Q)cos(D,
(Q) Query
(D) Document
0.7Q)cos(D,
18
The cosine measure (con’t) The cosine of 2 vectors (in N dimensions)
also known as the normalized inner product
N
1i
2i
N
1i
2i
N
1iii
qd
q d
Q D
QD)Q,Dcos(
lengths of the vectors
inner product
19
If you want proof… in 2-D space
to have vectors of length 1 (normalized vectors) divide all its components by the length of the vector in 2 dimensional space:
length) d(normalize 1L
YXL'
22
vector) of (length YXL 22
)coordinate X d(normalize Y X
X
LX
X’22
)coordinate Y d(normalize Y X
Y
LY
Y’22
can be skipped
20
Normalized vectors
Query = speech language
lan
gu
ag
e
d3’(0, 1) speech
d2’(1, 0)
d1’(0.45, 0.89)
Q’(0.71, 0.71)
Q(1,1) --> normalized Q’ (0.71, 0.71)
d1(1,2) --> normalized d1’ (0.45, 0.89)
d2(6,0) --> normalized d2’ (1, 0)
d3(0,5) --> normalized d3’ (0, 1)
1.4111L 22
2.2421L 22
606L 22
550L 22
1
1
can be skipped
21
Similarity between 2 vectors (2-D) In 2-D (ie. N= 2; nb of terms = 2)
with the original vectors: Q = (Xq, Yq) D = (Xd, Yd)
with the normalized vectors:
)Y(Y )X (X)w x(wD)sim(Q, dqdq
n
1i
id iq
L
Y ,
L
X )Y ,(X Q’
q
q
q
q'q
'q
LY
,LX
)Y ,(X D’d
d
d
d'd
'd
YX YX
YYXX
LL
YYXX
LY
L
Y
LX
L
X D)sim(Q,
2d
2d
2q
2q
dqdq
dq
dqdq
d
d
q
q
d
d
q
q
can be skipped
22
Similarity in the general case (N-D)
in the general case of N-dimensions (N-terms)
which is the cosine of the angle between the vector D and vector Q in N-dimensions
N
1i
2id
N
1i
2iq
N
1i
id iq
w x w
)w x(wD)sim(Q,
but for normalized vectors
can be skipped
23
The example again
Q = {speech language processing}query (1,1,1)
d1 (1,2,1) d2 (6,0,1) d3 (0,5,1)
d1 d2 d3 Q
introduction 1 0 0 0
knowledge 1 0 0 0
…
speech 1 6 0 1
language 2 0 5 1
processing 1 1 1 1
…
0.9433 x 6
121
)11(1 x )12(1
(1x1) (2x1) (1x1)Q),sim(d
2222221
0.6643 x 37
106
)11(1 x )10(6
(1x1) (0x1) (6x1)Q),sim(d
2222222
0.6803 x 26
150
)11(1 x )15(0
(1x1) (5x1) (0x1)Q),sim(d
2222223
N
1i
2i
N
1i
2i
N
1iii
qd
q d
Q D
QD)Q,Dcos( Q)sim(D,
24
Term weights so far, we have used term frequency as the weights core of most weighting functions:
tfij term frequency frequency of a term i in document j if a term appears often in a document, then it
describes well the document contents intra-document characterization
dfi document frequency number of documents in the collection containing
the term i if a term appears in many documents, then it is not
useful for distinguishing a document inter-document characterization used to compute idf
25
tf.idf weighting functions most widely used family of weighting functions let:
M = number of documents in the collection Inverse Document Frequency for term i // measures
weight
// of term i for the
// query
intuitively, if M = 1000 if dfi = 1000 --> log(1) = 0 --> term i is ignored ! (it appears in all docs) if dfi = 10 --> log(100) = 2 --> term i has weight of 2
in the query if dfi = 1 --> log(1000) = 3 --> term i has weight of 3 in the
query
weight of term i in document d is: wid = tfid x idfi
family of tf.idf functions
iidid df
Mlog x )log(tfw
ijd j
idid df
Mlog x
tf maxtf x 0.5
0.5w
frequency of most frequent term j in document d
ii df
Mlogidf
26
Evaluation: Precision & Recall
I n reality, the document is… System says…
pertinent non pertinent
document is pertinent A B document is not pertinent C D
A
A+CRecall=
A+B
APrecision =
Recall and precision measure how good a set of retrieved documents is compared with an ideal set of relevant documents
Recall: What proportion of relevant documents are actually retrieved?
Precision: What proportion of retrieved documents are really relevant?
All docs that were retrieved
Pertinent docs that were retrieved
Pertinent docs that were retrieved
All pertinent docs (that should have been retrieved)
27
Evaluation: Example of P&R
Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389
system1: d123 d84 d56
Precision : ?? Recall : ??
system2: d123 d84 d56 d6 d8 d9 Precision : ?? Recall : ??
28
Evaluation: Example of P&R Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389
system1: d123 d84 d56 Precision: 66% (2/3) Recall: 20% (2/10)
system2: d123 d84 d56 d6 d8 d9 Precision: 50% (3/6) Recall: 30% (3/10)
29
Evaluation: Problems with P&R
P&R do not evaluate the ranking d123 d84 d84 d123
so other measures are often used: Document cutoff levels P&R curves ...
30
Evaluation: Document cutoff levels fix the number of documents retrieved at several levels
ex. top 5, top 10, top 20, top 100, top 500… measure precision at each of these levels Ex: system 1 system 2 system 3
d1 d10 d6 d2 d9 d1 d3 d8 d2 d4 d7 d10 d5 d6 d9
d6 d1 d3 d7 d2 d5 d8 d3 d4 d9 d4 d7 d10 d5 d8
precision at 5 1.0 0.0 0.4 precision at 10 0.5 0.5 0.5
31
Evaluation: P&R curve measure precision at different levels of recall
usually, precision at 11 recall levels (0%, 10%, 20%, …, 100%)
recall
pre
cisi
on
0% 20%0%
20%
40% 60% 80% 100%
40%
60%
80%
100%
32
Which system performs better?
recall0% 20% 40% 60% 80% 100%
pre
cisi
on
0%
20%
40%
60%
80%
100%
33
Evaluation: A Single Value Measure
cannot take mean of P&R if R = 50% P = 50% M = 50% if R = 100% P = 10% M = 55% (not fair)
take harmonic meanHM is high only when both P&R are high
if R = 50% and P = 50% HM = 50%
if R = 100% and P = 10% HM = 18.2%
take weighted harmonic meanwr: weight of R wp: weight of P a = 1/wr b= 1/wp
let β2 = a/b … which is called the F-
measure
P1
R1
2HM
P1
bRa
ba
Pbb
Rba
bb)(a
Pb
Ra
baWHM
1
RPβPR 1)(β
P1
Rβ
1βWHM 2
2
2
2
34
Evaluation: the F-measure
A weighted combination of precision and recall
represents the relative importance of precision and recall when = 1, precision & recall have same importance when > 1, precision is favored when < 1, recall is favored
R)P(β1)PR(β
F 2
2
35
Evaluation: How to evaluate
Need a test collection document collection (few thousand - few
million documents) set of queries set of relevance judgements
humans must check all documents ??? use pooling (TREC)
take top 100 from every submission/system remove duplicates manually assess these only
36
Evaluation: TREC Text Retrieval Conference/Competition
run by NIST (National Institute of Standards and Technology) 13th edition in 2004
Collection: about 3 Gigabytes > 1 million documents newswire & text news (AP, WSJ, …)
Queries + relevance judgments queries devised and judged by annotators
Participants various research and commercial groups compete
Tracks cross-lingual, Filtering Track, Genome Track video-track, Web
Track, QA, ...
37
Top Related