INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0...
Transcript of INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0...
![Page 1: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/1.jpg)
INFO 4300 / CS4300Information Retrieval
slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/
IR 6: Ranking
Paul Ginsparg
Cornell University, Ithaca, NY
13 Sep 2011
1 / 48
![Page 2: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/2.jpg)
Administrativa
Course Webpage:http://www.infosci.cornell.edu/Courses/info4300/2011fa/
Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep
Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11
Instructor: Paul Ginsparg, ginsparg@..., 255-7371,Physical Sciences Building 452
Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mailinstructor to schedule an appointment
Teaching Assistant: Saeed Abdullah, office hour Fri3:30pm-4:30pm in the small conference room (133) at 301College Ave, and by email, use [email protected]
Course text at: http://informationretrieval.org/Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Schutze
see alsoInformation Retrieval , S. Buttcher, C. Clarke, G. Cormack
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307
2 / 48
![Page 3: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/3.jpg)
Administrativa
Reread assignment 1 instructions
The Midterm Examination is on Thu, Oct 13 from 11:40 to12:55, in Kimball B11. It will be open book. The topics to beexamined are all the lectures and discussion class readingsbefore the midterm break.
According to the registrar(http://registrar.sas.cornell.edu/Sched/EXFA.html ),the final examination is Wed 14 Dec 7:00-9:30 pm (locationTBD)
3 / 48
![Page 4: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/4.jpg)
Discussion 2, 20 Sep
For this class, read and be prepared to discuss the following:
K. Sparck Jones, “A statistical interpretation of termspecificity and its application in retrieval”.Journal of Documentation 28, 11-21, 1972.http://www.soi.city.ac.uk/∼ser/idfpapers/ksj orig.pdf
Letter by Stephen Robertson and reply by Karen SparckJones, Journal of Documentation 28, 164-165, 1972.http://www.soi.city.ac.uk/∼ser/idfpapers/letters.pdf
The first paper introduced the term weighting scheme known asinverse document frequency (IDF). Some of the terminology usedin this paper will be introduced in the lectures. The letterdescribes a slightly different way of expressing IDF, which hasbecome the standard form.(Stephen Robertson has mounted these papers on his Web sitewith permission from the publisher.)
4 / 48
![Page 5: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/5.jpg)
Overview
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
5 / 48
![Page 6: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/6.jpg)
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
6 / 48
![Page 7: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/7.jpg)
Query Scores: S(q, d) =∑
t∈q w(idf)t · w (tf)
t,d (ltn.lnn)1. “A sentence is a document.”2. “A document is a sentence and a sentence is a document.”3. “This document is short.”4. “This document is a sentence.”
tft,d doc1 doc2 doc3 doc4a 2 4 0 1and 0 1 0 0document 1 2 1 1is 1 2 1 1sentence 1 2 0 1short 0 0 1 0this 0 0 1 1
→
w(tf)t,d doc1 doc2 doc3 doc4
a 1.3 1.6 0 1and 0 1 0 0document 1 1.3 1 1is 1 1.3 1 1sentence 1 1.3 0 1short 0 0 1 0this 0 0 1 1
df w(idf)t
a 3 .125and 1 .6document 4 0is 4 0sentence 2 .3short 1 .6this 2 .3
[
log(4/4) = 0, log(4/3) ≈ .125, log(4/2) ≈ .3, log(4/1) ≈ .6]
Query: “a sentence”
doc1: .125 ∗ 1.3 + .3 ∗ 1 = .46, doc2: .125 ∗ 1.6 + .3 ∗ 1.3 = .59
doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ 1 + .3 ∗ 1 = .425
Query: “short sentence”
doc1: .6 ∗ 0 + .3 ∗ 1 = .3, doc2: .6 ∗ 0 + .3 ∗ 1.3 = .39
doc3: .6 ∗ 1 + .3 ∗ 0 = .6, doc4: .6 ∗ 0 + .3 ∗ 1 = .37 / 48
![Page 8: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/8.jpg)
Query Scores: S(q, d) =∑
t∈q w(idf)t · w (tf)
t,d (ltn.lnc)1. “A sentence is a document.”2. “A document is a sentence and a sentence is a document.”3. “This document is short.”4. “This document is a sentence.”
w(tf)t,d doc1 doc2 doc3 doc4
a 1.3 1.6 0 1and 0 1 0 0document 1 1.3 1 1is 1 1.3 1 1sentence 1 1.3 0 1short 0 0 1 0this 0 0 1 1
→
w(tf)t,d
doc1 doc2 doc3 doc4
a .60 .54 0 .45and 0 .34 0 0document .46 .44 .5 .45is .46 .44 .5 .45sentence .46 .44 0 .45short 0 0 .5 0this 0 0 .5 .45
df w(idf)t
a 3 .125and 1 .6document 4 0is 4 0sentence 2 .3short 1 .6this 2 .3
lengths(doc1,. . .,doc4) = (2.17, 2.94, 2, 2.24)
Query: “a sentence”
doc1: .125 ∗ .6 + .3 ∗ .46 = .21, doc2: .125 ∗ .54 + .3 ∗ .44 = .20
doc3: .125 ∗ 0 + .3 ∗ 0 = 0, doc4: .125 ∗ .45 + .3 ∗ .45 = .19
Query: “short sentence”
doc1: .6 ∗ 0 + .3 ∗ .46 = .14, doc2: .6 ∗ 0 + .3 ∗ .44 = .133
doc3: .6 ∗ .5 + .3 ∗ 0 = .3, doc4: .6 ∗ 0 + .3 ∗ .45 = .1348 / 48
![Page 9: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/9.jpg)
Cosine similarity between query and document
cos(~q, ~d) = sim(~q, ~d) =~q
|~q| ·~d
|~d |=
|V |∑
i=1
qi√
∑|V |i=1 q2
i
· di√
∑|V |i=1 d2
i
qi is the tf-idf weight (idf) of term i in the query.
di is the tf-idf weight (tf) of term i in the document.
|~q| and |~d | are the lengths of ~q and ~d .
~q/|~q| and ~d/|~d | are length-1 vectors (= normalized).
9 / 48
![Page 10: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/10.jpg)
Cosine similarity illustrated
0 10
1
rich
poor
~v(q)
~v(d1)
~v(d2)
~v(d3)
θ
10 / 48
![Page 11: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/11.jpg)
Variant tf-idf functions
We’ve considered sublinear tf scaling (wft,d = 1 + log tft,d)
Or normalize instead by maximum tf in document, tfmax(d):
ntft,d = a + (1 − a)tft,d
tfmax(d)
where a ∈ [0, 1] (e.g., .4) is a smoothing term to avoid large swingin ntf due small changes in tf.
This eliminates repeat content problem (d ′ = d + d), but hasother issues:
sensitive to change in stop word list
outlier terms with large tf
skewed distribution of many nearly most frequent terms.
11 / 48
![Page 12: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/12.jpg)
Components of tf.idf weighting
Term frequency Document frequency Normalization
n (natural) tft,d n (no) 1 n (none) 1
l (logarithm) 1 + log(tft,d) t (idf) log N
dftc (cosine) 1√
w21 +w2
2 +...+w2M
a (augmented) 0.5 +0.5×tft,d
maxt(tft,d )p (prob idf) max{0, log N−dft
dft} u (pivoted
unique)1/u
b (boolean)
{
1 if tft,d > 00 otherwise
b (byte size) 1/CharLengthα,α < 1
L (log ave)1+log(tf t,d )
1+log(avet∈d(tf t,d ))
Best known combination of weighting options
Default: no weighting
12 / 48
![Page 13: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/13.jpg)
tf.idf example
We often use different weightings for queries and documents.
Notation: qqq.ddd(term frequency / document frequency / normalization)for (query.document)
Example: ltn.lnc
query: logarithmic tf, idf, no normalization
document: logarithmic tf, no df weighting, cosinenormalization
Isn’t it bad to not idf-weight the document?
Example query: “best car insurance”
Example document: “car insurance auto insurance”
13 / 48
![Page 14: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/14.jpg)
tf.idf example: ltn.lnc
Query: “best car insurance”. Document: “car insurance auto insurance”.word query document product
tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized
auto 0 0 5000 2.3 0 1 1 1 0.52 0best 1 1 50000 1.3 1.3 0 0 0 0 0car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04
Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weightedterm frequency, df: document frequency, idf: inverse document frequency, weight: the finalweight of the term in the query or document, n’lized: document weights after cosinenormalization, product: the product of final query weight and final document weight
√12 + 02 + 12 + 1.32 ≈ 1.92
1/1.92 ≈ 0.521.3/1.92 ≈ 0.68
Final similarity score between query and document:∑
i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08
14 / 48
![Page 15: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/15.jpg)
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
15 / 48
![Page 16: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/16.jpg)
Parametric and Zone indices
Digital documents have additional structure: metadata encoded inmachine-parseable form (e.g., author, title, date of publication,. . .)
One parametric index for each field.
Fields: take finite set of values (e.g., dates of authorship)
Zones: arbitrary free text (e.g., titles, abstracts)
Permits searching for documents by Shakespeare written in 1601containing the phrase “alas poor Yorick”or find documents with “merchant” in title and “william” in authorlist and the phrase “gentle rain” in body
Use separate indexes for each field and zone,or use
william.abstract, william.title, william.author
Permits weighted zone scoring
16 / 48
![Page 17: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/17.jpg)
Weighted Zone Scoring
Given boolean query q and document d ,assign to the pair (q, d) a score in [0,1]by computing linear combination of zone scores.
Let g1, . . . , gℓ ∈ [0, 1] such that∑ℓ
i=1 gi = 1.
For 1 ≤ i ≤ ℓ, let si be the score between q and the i th zone.
Then the weighted zone score is defined as∑ℓ
i=1 gi si .
Example:Three zones: author title, bodyg1 = .2, g2 = .5, g3 = .3 (match in author zone least important)
Compute weighted zone scores directly from inverted indexes:Instead of adding document to set of results as for boolean ANDquery, now compute a score for each document.
17 / 48
![Page 18: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/18.jpg)
Learning Weights
How to determine the weights gi for weighted zone scoring?
A. specified by expert
B. “learned” using training examples that have been judgededitorially (machine-learned relevance)
1. given set of training examples [(q, d) plus relevancejudgment (e.g., yes/no)]
2. set the weights gi to best approximate the relevancejudgments
Expensive component: labor-intensive assembly of user-generatedrelevance judgments, especially expensive in rapidly changingcollection (such as Web).
Or use “passive collaborative feedback”? (clickthrough data)
18 / 48
![Page 19: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/19.jpg)
19 / 48
![Page 20: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/20.jpg)
Machine Learned Relevance
Given a table sT (d , q), sB(d , q) of Boolean matches,and relevance judgements r(d , q) (also, e.g., binary) of documentd relevant to query q (see fig. 6.5 in text), compute a score foreach of the training examples
score(d , q) = gsT (d , q) + (1 − g)sB(d , q)
and compare with r(d , q) using an error function
ε(g ,Φj ) =(
r(dj , qj) − score(dj , qj))2
for each training example.
Choose g to minimize total error∑
j ε(g ,Φj ) (a quadratic functionof g , so elementary algebra in this case; more generally asophisticated optimization problem).
20 / 48
![Page 21: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/21.jpg)
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
21 / 48
![Page 22: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/22.jpg)
Why is ranking so important?
Two lectures ago: Problems with unranked retrieval
Users want to look at a few results – not thousands.It’s very hard to write queries that produce a few results.Even for expert searchers→ Ranking is important because it effectively reduces a largeset of results to a very small one.
Next: More data on “users only look at a few results”
Actually, in the vast majority of cases they only look at 1, 2,or 3 results.
22 / 48
![Page 23: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/23.jpg)
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks
The following slides are from Dan Russell’s JCDL talk 2007
Dan Russell is the “Uber Tech Lead for Search Quality & UserHappiness” at Google.
23 / 48
![Page 24: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/24.jpg)
Interview video
So . . . Did you notice the FTD official site?
To be honest I didn’t even look at that.
At first I saw “from $20” and $20 is what I was looking for.
To be honest, 1800-flowers is what I’m familiar with and why Iwent there next even though I kind of assumed they wouldn’t have$20 flowers.
And you knew they were expensive?
I knew they were expensive but I thought “hey, maybe they’ve gotsome flowers for under $20 here . . .”
But you didn’t notice the FTD?
No I didn’t, actually. . . that’s really funny.
24 / 48
![Page 25: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/25.jpg)
![Page 26: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/26.jpg)
Local work
Granka, L., Joachims, T., and Gay, G. (2004)“Eye-Tracking Analysis of User Behavior in WWW Search”Proceedings of the 28th Annual ACM Conference on Research and
Development in Information and Retrieval (SIGIR ’04)
http://www.cs.cornell.edu/People/tj/publications/granka etal 04a.pdf
26 / 48
![Page 27: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/27.jpg)
![Page 28: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/28.jpg)
![Page 29: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/29.jpg)
![Page 30: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/30.jpg)
![Page 31: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/31.jpg)
Out of date?
Use of top and right margins
“instant” results
only for high bandwidth users?
mobile devices
31 / 48
![Page 32: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/32.jpg)
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
In 1 out of 2 cases, users click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users willclick on it.
→ Getting the ranking right is very important.
→ Getting the top-ranked page right is most important.
32 / 48
![Page 33: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/33.jpg)
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
33 / 48
![Page 34: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/34.jpg)
A problem for cosine normalization
Query q: “anti-doping rules Beijing 2008 olympics”
Compare three documents
d1: a short document on anti-doping rules at 2008 Olympicsd2: a long document that consists of a copy of d1 and 5 othernews stories, all on topics different from Olympics/anti-dopingd3: a short document on anti-doping rules at the 2004 AthensOlympics
What ranking do we expect in the vector space model?
d2 is likely to be ranked below d3 . . .
. . . but d2 is more relevant than d3.
What can we do about this?
34 / 48
![Page 35: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/35.jpg)
Pivot normalization
Cosine normalization produces weights that are too large forshort documents and too small for long documents (onaverage).
Adjust cosine normalization by linear adjustment: “turning”the average normalization on the pivot
Effect: Similarities of short documents with query decrease;similarities of long documents with query increase.
This removes the unfair advantage that short documents have.
35 / 48
![Page 36: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/36.jpg)
Predicted and true probability of relevance
source:Lillian Lee
36 / 48
![Page 37: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/37.jpg)
Pivot normalization
source:Lillian Lee
37 / 48
![Page 38: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/38.jpg)
Outline
1 Recap
2 Zones
3 Why rank?
4 More on cosine
5 Implementation
38 / 48
![Page 39: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/39.jpg)
Now we also need term frequencies in the index
Brutus −→ 1,2 7,3 83,1 87,2 . . .
Caesar −→ 1,1 5,1 13,1 17,1 . . .
Calpurnia −→ 7,1 8,2 40,1 97,3
term frequencies
We also need positions. Not shown here.
39 / 48
![Page 40: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/40.jpg)
Term frequencies in the inverted index
In each posting, store tft,d in addition to docID d
As an integer frequency, not as a (log-)weighted real number. . .
. . . because real numbers are difficult to compress.
Unary code is effective for encoding term frequencies.
Why?
Overall, additional space requirements are small: much lessthan a byte per posting.
40 / 48
![Page 41: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/41.jpg)
How do we compute the top k in ranking?
In many applications, we don’t need a complete ranking.
We just need the top k for a small k (e.g., k = 100).
If we don’t need a complete ranking, is there an efficient wayof computing just the top k?
Naive:
Compute scores for all N documentsSortReturn the top k
What’s bad about this?
Alternative?
41 / 48
![Page 42: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/42.jpg)
Use min heap for selecting top k out of N
Use a binary min heap
A binary min heap is a binary tree in which each node’s valueis less than the values of its children.
Takes O(N log k) operations to construct (where N is thenumber of documents) . . .
. . . then read off k winners in O(k log k) steps
Essentially linear in N for small k and large N.
42 / 48
![Page 43: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/43.jpg)
Binary min heap
0.6
0.85 0.7
0.9 0.97 0.8 0.95
43 / 48
![Page 44: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/44.jpg)
Selecting top k scoring documents in O(N log k)
Goal: Keep the top k documents seen so far
Use a binary min heap
To process a new document d ′ with score s ′:
Get current minimum hm of heap (O(1))If s ′ ≤ hm skip to next documentIf s ′ > hm heap-delete-root (O(log k))Heap-add d ′/s ′ (O(log k))
44 / 48
![Page 45: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/45.jpg)
Even more efficient computation of top k?
Ranking has time complexity O(N) where N is the number ofdocuments.
Optimizations reduce the constant factor, but they are stillO(N) — and 1010 < N < 1011!
Are there sublinear algorithms?
Ideas?
What we’re doing in effect: solving the k-nearest neighbor(kNN) problem for the query vector (= query point).
There are no general solutions to this problem that aresublinear.
We will revisit when we do kNN classification
45 / 48
![Page 46: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/46.jpg)
Cluster pruning
Cluster docs in preprocessing step
Pick√
N “leaders”
For non-leaders, find nearest leader (expect√
N / leader)
For query q, find closest leader L (√
N computations)
Rank L and followers
or generalize: b1 closest leaders, and then b2 leaders closest toquery
46 / 48
![Page 47: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/47.jpg)
47 / 48
![Page 48: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1 ...](https://reader036.fdocuments.us/reader036/viewer/2022071212/6025e2a36d34cf318a71d121/html5/thumbnails/48.jpg)
Even more efficient computation of top k
Idea 1: Reorder postings lists
Instead of ordering according to docID . . .. . . order according to some measure of “expected relevance”.
Idea 2: Heuristics to prune the search space
Not guaranteed to be correct . . .. . . but fails rarely.In practice, close to constant time.For this, we’ll need the concepts of document-at-a-timeprocessing and term-at-a-time processing.
48 / 48