Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more...
-
Upload
faith-horn -
Category
Documents
-
view
220 -
download
1
Transcript of Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more...
![Page 1: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/1.jpg)
Basic IR: Modeling Basic IR Task:
Match a subset of documents to the user’s query
Slightly more complex: and rank the resulting documents by
predicted relevance
The derivation of relevance leads to different IR models.
![Page 2: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/2.jpg)
Concepts: Term-Document Incidence
Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise.
Queries satisfied how? Problems?
search segment
select semantic
…
MIR 1 0 1 1
AI 1 1 0 1
…
![Page 3: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/3.jpg)
Concepts: Term Frequency To support document ranking, need
more than just term incidence. Term frequency records number of
times a given term appears in each document.
Intuition: More times a term appears in a document the more central it is to the topic of the document.
![Page 4: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/4.jpg)
Concept: Term Weight Weights represent the importance of
a given term for characterizing a document.
wij is a weight for term i in document j.
![Page 5: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/5.jpg)
Mapping Task and Document Type to Model
Index Terms
Full Text Full Text + Structure
Searching (Retrieval)
Classic Classic Structured
Surfing (Browsing)
Flat FlatHypertext
Structure GuidedHypertext
![Page 6: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/6.jpg)
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
boolean vector probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext from MIR text
![Page 7: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/7.jpg)
Classic Models: Basic Concepts
Ki is an index term dj is a document t is the total number of docs K = (k1, k2, …, kt) is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to
doc vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj gi(vec(dj)) = wij is a function which returns the
weight associated with pair (ki,dj)
![Page 8: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/8.jpg)
Classic: Boolean Model Based on set theory: map queries with
Boolean operations to set operations Select documents from term-
document incidence matrix Pros:Cons:
![Page 9: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/9.jpg)
Exact Matching Ignores… term frequency in document term scarcity in corpus size of document ranking
![Page 10: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/10.jpg)
Vector Model Vector of term weights based on term
frequency Compute similarity between query
and document where both are vectors vec(dj) = (w1j, w2j, ..., wtj)
vec(q) = (w1q, w2q, ..., wtq) Similarity is the cosine of the angle
between the vectors.
![Page 11: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/11.jpg)
Cosine Measure
Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
j
dj
q
from MIR notes
t
iqi
t
iji
t
iqiji
j
j
ww
ww
qd
qdqdSim
1
2
,1
2
,
1,,
)cos(
),(
![Page 12: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/12.jpg)
How to Set Wij Weights? TF-IDF
Within document: Term-Frequency tf measures term density within a
document Across document: Inverse Document
Frequency idf measures informativeness or rarity of
term across corpus.
dfnidf
i
i log
![Page 13: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/13.jpg)
TF * IDF Computation
)/log(,, ididi dfntfw
rmcontain te that documents ofnumber the
documents ofnumber total
document in termoffrequency ,
idf
n
ditf
i
di
What happens as number of occurrences in a document increases?
What happens as term becomes more rare?
![Page 14: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/14.jpg)
TF * IDF TF may be normalized.
tf(i,d) = freq(i,d) / max(freq(l,d)) IDF is computed
normalized to size of corpus as log to make TF and IDF values
comparable IDF requires a static corpus.
![Page 15: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/15.jpg)
How to Set Wi,q Weights?
1. Create Vector directly from query2. Use modified tf-idf
i
qi df
n
qifreq
qifreqW log*
)),(max(
),(*5.05.0,
![Page 16: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/16.jpg)
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 d1 2 0 1 d2 1 0 0 d3 0 1 3 d4 2 0 0 d5 1 2 4 d6 1 2 0 d7 0 5 0
q 1 2 3
from MIR notes
The Vector Model: Example
![Page 17: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/17.jpg)
d1
d2
d3d4 d5
d6d7
k1k2
k3
from MIR notes
The Vector Model: Example (cont.)
1. Compute Tf-IDF Vector for each documentFor first document:K1: ((2/2)*(log (7/5)) = .33K2: (0*(log (7/4))) = 0K3: ((1/2)*(log (7/3))) = .42
for rest:[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85], [.17 .56 0], [0 .56 0]
![Page 18: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/18.jpg)
The Vector Model: Example (cont.)
2. Compute the Tf-IDF for the query [1 2 3]:K1: (.5 + ((.5 * 1)/3))*(log (7/5)))K2: (.5 + ((.5 * 2)/3))*(log (7/4)))K3: (.5 + ((.5 * 3)/3))*(log (7/3)))which is: [.22 .47 .85]
d1
d2
d3d4 d5
d6d7
k1k2
k3
![Page 19: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/19.jpg)
The Vector Model: Example (cont.)
3. Compute the Sim for each document:D1:
D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43
|D1| = sqrt((.33^2) + (.42^2)) = .53|q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0sim = .43 / (.53 * 1.0) = .81
D2: .22 D3: .93 D4: .23 D5: .97 D6: .51 D7: .47
d1
d2
d3d4 d5
d6d7
k1k2
k3
![Page 20: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/20.jpg)
Vector Model Implementation Issues Sparse TermXDocument matrix Store term count, term weight, or
weighted by idfi ? What if the corpus is not fixed (e.g.,
the Web)? What happens to IDF? How to efficiently compute Cosine
for large index?
![Page 21: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/21.jpg)
Heuristics for Computing Cosine for Large Index
Select from only non-zero cosines Focus on non-zero cosines for rare
(high idf) words Pre-compute document adjacency
for each term, pre-compute k nearest docs for a t term query, compute cosines from
query to union of t pre-computed lists, choose top k
![Page 22: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted.](https://reader031.fdocuments.us/reader031/viewer/2022032307/56649c775503460f9492c1c8/html5/thumbnails/22.jpg)
Pros: term-weighting improves quality cosine ranking formula sorts documents
according to degree of similarity to the query
Cons: assumes independence of index terms
The TFIDF Vector Model: Pros/Cons