NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden ...
Speech & NLP (Fall 2014): Information Retrieval
-
Upload
vladimir-kulyukin -
Category
Science
-
view
160 -
download
0
Transcript of Speech & NLP (Fall 2014): Information Retrieval
![Page 1: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/1.jpg)
Speech & NLP
www.vkedco.blogspot.com
Information Retrieval
Texts as Feature Vectors, Vector Spaces,
Vocabulary Normalization through Stemming & Stoplisting,
Porter’s Algorithm for Suffix Stripping,
Term Weighting, Query Expansion, Precision & Recall
Vladimir Kulyukin
![Page 2: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/2.jpg)
Outline
● Texts as Feature Vectors
● Vector Space Model
● Vocabulary Normalization through Stemming & Stoplisting
● Porter’s Algorithm for Suffix Stripping (aka Porter’s Stemmer)
● Term Weighting
● Query Expansion
● Precision & Recall
![Page 3: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/3.jpg)
Texts as Feature Vectors
![Page 4: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/4.jpg)
Text as Collection of Words ● Any text can be viewed as a collection of words (collections,
unlike sets, allow for duplicates)
● Various techniques can be designed to compute different
properties of texts: most frequent word, least frequent word,
frequency of a word in a text, word n-grams, word co-occurrence
probabilities, part of speech, etc.
● Each such technique is a feature extractor: it extracts from text
specific features (e.g., a single word) and assigns to them
specific weights (e.g., the frequency of that word in the text) or
symbols (part of speech)
● Feature extraction turns a text from a collection of words into a
feature vector
![Page 5: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/5.jpg)
Information Retrieval
● Information Retrieval (IR) is an area of NLP that
deals with storage and retrieval of digital media
● The primary focus of IR has been digital texts
● Other media such as images, videos, audio files
have received more prominent focus recently
![Page 6: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/6.jpg)
Basic IR Terminology
● Document is an indexable and retrievable unit of
digital text
● Collection is a set of documents that can be
searched by users
● Term is a wordform that occurs in a collection
● Query is a set of terms
![Page 7: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/7.jpg)
Vector Space Model
![Page 8: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/8.jpg)
Background
● Vector Space Model of IR was invented by G. Salton
in the early 1970’s
● Document collection is a vector space
● Terms found in texts are dimensions of that vector
space
● Documents are vectors in the vector space
● Term weights are coordinates along specific
dimensions
![Page 9: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/9.jpg)
Example: A 3D Feature Vector Space
● Suppose that all texts in our universe consist of three words w1,
w2, and w3
● Suppose that there are three texts T1, T2, and T3 such that
– T1 = “w1 w1 w2”
– T2 = “w3 w2”
– T3 = “w3 w3 w1”
● Suppose that our feature extraction procedure takes each word
in text and maps it to its frequency in that text
● Since there are three words, each feature vector has 3
dimensions; hence, we have a 3D vector space
![Page 10: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/10.jpg)
Vector Space as Feature Vector Table
w1 w2 w3
T1 2 1 0
T2 0 1 1
T3 1 0 2
𝑇𝑖 𝐢𝐬 𝐚 𝐭𝐞𝐱𝐭 𝐝𝐨𝐜𝐮𝐦𝐞𝐧𝐭
𝑇𝑖 is a feature vector
![Page 11: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/11.jpg)
3D Vector Space
w1
w2
w3
T1 = (2, 1, 0)
T2 = (0, 1, 1) T3 =(1, 0, 2)
![Page 12: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/12.jpg)
Another Example: A 3D Feature Vector Space
● Suppose that all texts in our universe consist of three words w1,
w2, and w3
● Suppose that there are three texts T1, T2, and T3 such that
– T1 = “w1 w1 w2”
– T2 = “w3 w2”
– T3 = “w3 w3 w1”
● Suppose that our feature extraction procedure takes each word
in text and simply records its presence (1) or absence in the
document
![Page 13: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/13.jpg)
Vector Space as Binary Feature Vector Table
w1 w2 w3
T1 1 1 0
T2 0 1 1
T3 1 0 1
![Page 14: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/14.jpg)
Matching Queries Against Vector Tables
● Let twf be a term weighting function that assigns a numerical
weight to a specific term in a specific document
● For example, if the query q = “w1 w3”, i.e., the user enters “w1
w3”, then 𝑞 = 𝑡𝑤𝑓 𝑞, 𝑤1 , 𝑡𝑤𝑓 𝑞,𝑤2 , 𝑡𝑤𝑓(𝑞, 𝑤3)
● If the feature vector table is binary, then 𝑞 = 1,0,1
● One similarity that can be used to rank binary documents is as
follows:
𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1 , where n
is the dimension of the vector space (e.g., n = 3)
![Page 15: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/15.jpg)
Matching Queries Against Vector Tables
● Suppose the query q = “w1 w3” and the feature vector table is binary, then
𝑞 = 1,0,1
● Below are the binary (dot product) similarity coefficients for each
document in our 3D document collection (n = 3):
𝑠𝑖𝑚 𝑞 , 𝑇1 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇1, 𝑤𝑘
𝑛=3
𝑘=1
= 1 ∙ 1 + 0 ∙ 1 + 1 ∙ 0 = 1
𝑠𝑖𝑚 𝑞 , 𝑇2 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇2, 𝑤𝑘
𝑛=3
𝑘=1
= 1 ∙ 0 + 0 ∙ 1 + 1 ∙ 1 = 1
𝑠𝑖𝑚 𝑞 , 𝑇3 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇3, 𝑤𝑘
𝑛=3
𝑘=1
= 1 ∙ 1 + 0 ∙ 0 + 1 ∙ 1 = 2
![Page 16: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/16.jpg)
Matching Queries Against Vector Tables
● Another common metric is cosine, which is equal to 1 for
identical vectors and 0 to orthogonal vectors:
𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1
𝑡𝑤𝑓 𝑞,𝑤𝑘𝑁𝑘=1 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘
𝑁𝑘=1
![Page 17: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/17.jpg)
Two Principal Tasks for Vector Space Model
● If the vector space model is to used, we have to
– determine how to compute terms (vocabulary
normalization)
– determine how to assign weights to terms in
individual documents (term weighting)
![Page 18: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/18.jpg)
Vocabulary Normalization
through
Stemming & Stoplisting
![Page 19: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/19.jpg)
Vocabulary Normalization
● Texts contain many words that are morphologically related:
CONNECT, CONNECTED, CONNECTING, CONNECTION,
CONNECTIONS
● There are also many words in most texts that do not distinguish
them from other texts: TO, UP, FROM, UNTIL, THE, A, BY, etc.
● Stemming is the operation of conflating different wordforms into
a single wordform, called stem; CONNECTED, CONNECTING,
CONNECTION, CONNECTIONS are all conflated to CONNECT
● Stoplisting is the operation of removing wordforms that do not
distinguish texts from each other
● Stemming & stoplisting are vocabulary normalization procedures
![Page 20: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/20.jpg)
Vocabulary Normalization
● Stemming & stoplisting are two most common vocabulary
normalization procedures
● Both procedures are aimed at standardizing the indexing
vocabulary
● Both procedures reduce the size of the indexing
vocabulary, which is a great time and space booster
● After vocabulary normalization is done, the remaining
words are called terms
![Page 21: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/21.jpg)
Porter’s Algorithm
for
Suffix Stripping
Martin Porter’s original paper is at http://tartarus.org/martin/PorterStemmer/def.txt
Source code in various languages is at http://tartarus.org/martin/PorterStemmer/
![Page 22: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/22.jpg)
Suffix Stripping Approaches
● Use a stem list
● Use a suffix list
● Use a set of rules that match wordforms & remove
suffixes under specified conditions
![Page 23: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/23.jpg)
Pros & Cons of Suffix Stripping
● Suffix stripping is done not for linguistic reasons
but to improve retrieval performance & storage
efficiency
● It is reasonable when wordform conflation does
not lose information (e.g., CONNECTOR &
CONNECTION)
● It does not seem reasonable when conflation is
lossy (e.g., RELATIVE & RELATIVITY are
conflated)
![Page 24: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/24.jpg)
Pros & Cons of Suffix Stripping
● Suffix stripping is never 100% correct
● The same rule set conflates SAND and SANDER,
which is OK, but it also conflates WAND and
WANDER, which may not be OK
● With any set of rules there comes a point when
adding more rules actually worsens performance
● Exceptions are important but may not be worth
the trouble
![Page 25: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/25.jpg)
Consonants & Vowels
● A consonant is a letter different from A, E, I,
O, U and different from Y when it is
preceded by a consonant
● Y is a consonant when it is preceded by A, E,
I, O, U: in TOY, Y is a consonant; in BY, it is
a vowel
● A vowel is not a consonant
![Page 26: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/26.jpg)
Consonants & Vowels
● A consonant is denoted as c and a vowel as v
● A sequence of at least one consonant (e.g.,
c, cc, ccc, cccc, etc) is denoted as C
● A sequence of at least one vowel (e.g., v, vv,
vvv, etc.) is denoted as V
![Page 27: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/27.jpg)
Porter’s Insight: Wordform Representation
● Any wordform can be represented as one of the four forms:
– CVCV … C
– CVCV … V
– VCVC … C
– VCVC … V
● These forms are condensed into one form: [C]VCVC … [V]
(square brackets denote sequences of zero or more consonants or
vowels)
● This form can be rewritten as [C](VC)m[V], m >= 0
![Page 28: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/28.jpg)
Porter’s Insight: Wordform Representation
● In the formula [C](VC)m[V], m >= 0, m is called the measure of a word
● Examples:
● m=0: TR, EE, TREE, Y, BY
● m=1: TROUBLE, OATS, TREES, IVY
– TROUBLE: [C] TR; (VC) OUBL; [V] E
– OATS: [C] NULL; (VC) OATS; [V] NULL
– TREES: [C] TR; (VC) EES; [V] NULL
● m=2: TROUBLES, PRIVATE
– TROUBLES: [C] TR; (VC)2 (OUBL)(ES); [V] NULL
– PRIVATE: [C] PR; (VC)2 (IV)(AT); [V] E
![Page 29: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/29.jpg)
Morphological Rules
● Suffix removal rules have the form
– (condition) S1 S2
● If a wordform ends with suffix S1 and the stem
before S1 satisfies the (optional) condition, then
S1 is replaced with S2
● Example:
– (m > 1) EMENT
– S1 is EMENT; S2 is NULL
– This rule maps REPLACEMENT to REPLAC
![Page 30: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/30.jpg)
Morphological Rules: Condition Specification
● Conditions can be specified as follows:
– (m > n), where n is a number
– *X – stem ends with the letter X
– *v* - stem contains a vowel
– *d – stem with a double consonant (e.g., -TT)
– *o – stem ends in cvc where the second c is not W, X, or Y
(e.g., -WIL, -HOP)
● Logical AND, OR, and NOT operators are also allowed:
– ((m > 1) AND (*S OR *T))
![Page 31: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/31.jpg)
Length-Based Rule Matching
● If there are several rules match, the one with the
longest S1 wins
● Consider this rule set with null conditions:
– SSES SS
– IES I
– SS SS
– S
● Given this rule set, CARESSES CARESS because SSES is
the longest match and CARES CARE
![Page 32: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/32.jpg)
Five Rule Sets
● In the original paper by M.F. Porter, there are eight sets
of rules: 1A, 1B, 1C, 2, 3, 4, 5A, 5B
● A wordform passes through each rule set one by one
staring from 1A and ending at 5B, in that order
● If no rule in a rule set is applicable, the wordform comes
out unmodified
1A 1B 1C 2 3 4
5A 5B
W
W’
![Page 33: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/33.jpg)
Example
1A: S
3: (m>0) ALIZE AL
GENERALIZATIONS GENERALIZATION 2: (m>0) IZATION IZE
GENERAL
4: (m>0) AL
GENER
GENERALIZE
![Page 34: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/34.jpg)
Term Weighting
![Page 35: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/35.jpg)
Term Weighting in Documents
● Term weighting has a large influence on the
performance of IR systems
● In general, there are two design factors that bear on
term weighting:
– How important is a term within a given document?
– How important is a term within a given collection?
● A common measure of term importance within a single
document is its frequency in that document (this is
commonly referred to as term frequency – tf)
![Page 36: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/36.jpg)
Term Weighting in Collections
● Terms that occur in every document or many
documents in a given collection are not useful as
document discriminators
● Terms that occur in relatively few documents in a
given collection are useful as document
discriminators
● Generally, collection-wide term weighting
approaches value terms that occur in relatively
few documents
![Page 37: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/37.jpg)
Inverse Document Frequency
● Suppose that we have some document collection C
● Let N be the total number of documents in C
● Let 𝑛𝑖 be the number of documents in C that contain at least one
occurrence of the i-th term 𝑡𝑖
● Then the inverse document frequency of 𝑡𝑖
𝑖𝑑𝑓 𝑡𝑖 , 𝐶 = 𝑙𝑜𝑔𝑁
𝑛𝑖
![Page 38: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/38.jpg)
Example: IDF
W1 W1 W2 W3
W3 W3 W3 W3
W2 W2 W2 W1 W3
W3 W3 W3 W1 W1
T1 T2 T3 T4
𝑖𝑑𝑓 𝑊1, 𝐶 = 𝑙𝑜𝑔4
3, because N = 4 and 𝑛1 = 3
𝑖𝑑𝑓 𝑊2, 𝐶 = 𝑙𝑜𝑔4
2, because N = 4 and 𝑛2 = 2
𝑖𝑑𝑓 𝑊3, 𝐶 = 𝑙𝑜𝑔4
4, because N = 4 and 𝑛2 = 4
𝐶 = 𝑇1, 𝑇2, 𝑇3, 𝑇4
![Page 39: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/39.jpg)
TF*IDF: Combining Local and Global Weights
● Suppose that we have some document collection C
● Let N be the total number of documents in C
● Let 𝑛𝑖 be the number of documents in C that contain at least one
occurrence of the i-th term 𝑡𝑖
● Let 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 be the frequency of the term 𝑡𝑖 in the document 𝑇𝑗 of
collection C
● Let 𝑖𝑑𝑓 𝑡𝑖 , 𝐶 be the inverse document frequency of the term 𝑡𝑖 in
collection C
● Then the tfidf measure of 𝑡𝑖 in 𝑇𝑗 of C:
𝑡𝑓𝑖𝑑𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 = 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 ∙ 𝑖𝑑𝑓 𝑡𝑖 , 𝐶
![Page 40: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/40.jpg)
Example: TF*IDF
W1 W1 W2 W3
W3 W3 W3 W3
W2 W2 W2 W1 W3
W3 W3 W3 W1 W1
T1 T2 T3 T4
𝑡𝑓𝑖𝑑𝑓 𝑊1, 𝑇1, 𝐶 = 𝑡𝑓 𝑊1, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊1, 𝐶 = 2 ∙ 𝑙𝑜𝑔4
3
𝑡𝑓𝑖𝑑𝑓 𝑊2, 𝑇1, 𝐶 = 𝑡𝑓 𝑊2, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊2, 𝐶 = 1 ∙ 𝑙𝑜𝑔4
2
𝑡𝑓𝑖𝑑𝑓 𝑊3, 𝑇1, 𝐶 = 𝑡𝑓 𝑊3, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊3, 𝐶 = 1 ∙ 𝑙𝑜𝑔4
4
![Page 41: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/41.jpg)
User Query Expansion
![Page 42: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/42.jpg)
Improving User Queries
● Typically, we cannot change the content of the indexed documents: once a
collection is indexed, we can documents to it, remove documents from it,
but we cannot change the weights of the terms within the vector space
model
● What we can do is improve the user query
● But how? We can dynamically change the weights of the terms in
the user query to move it closer to the more relevant documents
● The standard method of doing it in the vector space model is
called relevance feedback
![Page 43: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/43.jpg)
How Relevance Feedback Works
● The user types in a query
● The system retrieves a set of documents
● The user specifies whether each document is relevant or not to the query:
this can be done on every document in the retrieved set or a small subset of
documents
● The system dynamically increases the weights of the terms in the relevant
documents and decreases the weights of the terms in the non-relevant
documents
● In several iterations, the user query vector ends up being pushed closer to
the relevant documents and further from the non-relevant documents
![Page 44: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/44.jpg)
Rocchio Relevance Feedback Formula
● Suppose that we have some document collection C
● Let 𝑞𝑖 be the user query vector at the i-th iteration (i.e., 𝑞0 is the original user query vector
● Let us assume that 𝑅 = 𝑟1, … 𝑟|𝑅| is the set of relevant document
vectors from C and 𝑁𝑅 = 𝑛𝑟1, … , 𝑛𝑟|𝑁𝑅| is the set of non-relevant
document vectors from C
● The query vector on the next iteration is:
𝑞𝑖+1 = 𝑞𝑖 +𝛽
|𝑅| 𝑟𝑗|𝑅|𝑗=1 −
𝛾
𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1
![Page 45: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/45.jpg)
Rocchio Relevance Feedback Formula
● The query vector on the next iteration is:
𝑞𝑖+1 = 𝑞𝑖 +𝛽
|𝑅| 𝑟𝑗|𝑅|𝑗=1 −
𝛾
𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1
𝑞𝑖+1 = 𝑞𝑖 +𝛽
|𝑅| 𝑟𝑗
|𝑅|
𝑗=1
−𝛾
𝑁𝑅 𝑛𝑟𝑘
𝑁𝑅
𝑘=1
=
𝑞𝑖 +𝛽
|𝑅|𝑟1 +⋯+
𝛽
|𝑅|𝑟|𝑅| −
𝛾
|𝑁𝑅|𝑛𝑟1 +⋯+
𝛾
|𝑁𝑅|𝑛𝑟|𝑁𝑅|
![Page 46: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/46.jpg)
Thesaurus-Based Query Expansion
● Another commonly used strategy is to have a
thesaurus
● The thesaurus is used to expand the user query by
adding terms to it (e.g., synonyms or correlated
terms)
● Thesaurai are typically collection dependent and
do not generalize across different collections
![Page 47: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/47.jpg)
Performance Evaluation
● There are two commonly used measures of relevance in
IR
● Recall = (number of relevant documents
retrieved)/(total number of relevant documents in
collection C)
● Precision = (number of relevant documents
retrieved)/(number of documents retrieved)
● Typically, recall and precision are inversely related: as
precision increases, recall drops and vice versa
![Page 48: Speech & NLP (Fall 2014): Information Retrieval](https://reader033.fdocuments.us/reader033/viewer/2022050922/5589e164d8b42a732c8b460c/html5/thumbnails/48.jpg)
References
1. M. Porter. “An Algorithm for Suffix Stripping.” Program, 14
no. 3, pp 130-137, July 1980.
2. D. Jurafsky & J. Martin. “Speech and Language Processing”,
Ch. 17.