C.Watterscsci64031 Probabilistic Retrieval Model.
-
Upload
valerie-heath -
Category
Documents
-
view
216 -
download
0
Transcript of C.Watterscsci64031 Probabilistic Retrieval Model.
![Page 1: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/1.jpg)
C.Watters csci6403 1
Probabilistic Retrieval Model
![Page 2: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/2.jpg)
C.Watters csci6403 2
Classification Problem
• For each query assume– R=Set of relevant docs– NR=Set of nonrelevant docs
• For each document then what is the probability that it belongs in one or other set
• Retrieve dj if P(djis rel) > P(djis not rel)
![Page 3: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/3.jpg)
C.Watters csci6403 3
Bayes Theorem
• Probability based on related occurrences
• So P(R|di) is probability that a doc is R given that it has been retrieved
• Ex. P(H|E) prob it is July(Hyp) if it is hot(Event)
P(E|H) * P(H) (prob its hot given it is July)
=---------------------
P(E|Hi) *P(Hi) (prob its hot given its Jan etc)
![Page 4: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/4.jpg)
C.Watters csci6403 4
Assumption
• Distribution of keywords of interest is different in the relevant docs vs the not relevant docs
• Also known as the cluster hypothesis
![Page 5: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/5.jpg)
C.Watters csci6403 5
• Getting visas for immigration to Australia and migration within the borders requires a two week entry permit….
• The long range migration pattern of geese interesting enough does not include the southern Pacific ….
![Page 6: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/6.jpg)
C.Watters csci6403 6
How to estimate these probabilities???
• Assume relevance depends only on query and document representation (keywords)
• Computing the odds of a given doc being relevant to a given query!P(dj rel to q)
P(dj notrel to q)
• Use this to rank documents
![Page 7: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/7.jpg)
C.Watters csci6403 7
Similarity as Odds
• Sim (dj,q) = P(dj is rel)
P(dj is not rel)
• Using Bayes get
• Sim (dj,q) = P(dj|R) * P(R)
P(dj|NR) * P(NR)
![Page 8: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/8.jpg)
C.Watters csci6403 8
Move from docs to terms• Assuming independence of terms
• P(ki|R) is the probability that a relevant doc contains the term ki
• Remember that any term may also occur in NR docs so P(ki|R) + P(ki|R) =1
• Sim (dj,q) ~ wi,q * wi,j (log P(ki|R) + log 1-P(ki|NR)
• 1- P(ki|R) P(ki|NR) )• GIVES us RANK
![Page 9: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/9.jpg)
C.Watters csci6403 9
OK now what?
• Work with keywords with weights 0 and 1
• Query is a set of keywords
• Doc is a set of keywords
• Need P(ki|R)
• Prob that a keyword occurs in one of the relevant docs
![Page 10: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/10.jpg)
C.Watters csci6403 10
Getting Started
1. assume that P(ki|R) constant over all k
= 0.5 (even odds) for any given doc
Looking for terms that do not fit this!
2. assume P(ki|NR) = ni/N
i.e based on distribution of terms overall
![Page 11: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/11.jpg)
11
Finding P(ki)
1. First, retrieve set of docs and determine R set V
Vi is subset of V containing keyword ki
Need to improve our guesses for P(ki|R) & P(ki|NR)
2. So
Use distribution of ki in docs in V
P(ki|R) = Vi / V
3. Assume if not retrieved then not relevant
P(ki|NR) = (ni – Vi) / N-V
![Page 12: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/12.jpg)
C.Watters csci6403 12
Now
• Use new probs to rerank docs
• And try again
• This can be done without human judgement BUT it helps to get real feedback at step 1
![Page 13: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/13.jpg)
C.Watters csci6403 13
Good and Bad News
• Advantages– Ranking scheme
• Disadvantages– Making the initial guess to get Vi
– Binary weights– Independence of terms– Computation
![Page 14: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/14.jpg)
C.Watters csci6403 14
Relevance Feedback
![Page 15: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/15.jpg)
C.Watters csci6403 15
Relevance Feedback
• Problem– 2.2 term queries without (explicit) structure
• Example (relevance feedback)• Manual
– Add terms– Remove terms– Adjust the weights if possible– Add/remove operators
![Page 16: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/16.jpg)
C.Watters csci6403 16
What can we do automatically?
• ????
• change query based on documents retrieved
• change query based on user preferences
• Change query based on user history
• Change query based on community of users
![Page 17: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/17.jpg)
C.Watters csci6403 17
Hypothesis
• A better query can be discovered by analyzing the features in relevant and in nonrelevant items
![Page 18: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/18.jpg)
C.Watters csci6403 18
Feedback and VSM
• Q0 = (q1, q2, … qt) , qi is weight of query term
• Generates H0
• Q’ = (q1’, q2’, … qt’) , qi’ is altered weight of query term
• Add term to query by increasing w > 0
• Drop term by decreasing w = 0
![Page 19: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/19.jpg)
C.Watters csci6403 19
![Page 20: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/20.jpg)
C.Watters csci6403 20
![Page 21: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/21.jpg)
C.Watters csci6403 21
VSM View
• Move query vector in the t-dimensional term space from an area of lower density to an area of higher density of close documents
![Page 22: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/22.jpg)
C.Watters csci6403 22
Optimal Query and VSM
• Given
• Sim(Dj,Q)= dij . Qi
• Optimal Query is then (Di is term vector)
• Qopt=
• |Di| is Euclidian vector length
![Page 23: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/23.jpg)
C.Watters csci6403 23
Feedback from relevant Docs retrieved
• Keep original query
• Replace sums with those on known relevant and known nonrelevant docs
• Q1=
• Qi+1=
![Page 24: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/24.jpg)
C.Watters csci6403 24
Example
• Q’= Q+ R- NR
• Q0 = (5,0,3,0,1)
• Relevant: D1=(2,1,2,0,0)
• Nonrelevant: D2=(1,0,0,0,2)
=1, =.5, =.25
• Q1=(5,0,3,0,1)+.5(2,1,2,0,0)-.25(1,0,0,0,2)
• =(5.75,.5,4,0,.5)
![Page 25: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/25.jpg)
C.Watters csci6403 25
Variations
• Don’t normalize by number of judged docs• Use only highest ranked non-relevant docs
– Effective with few judged docs
• Rocchio: choose and =1 for many judged docs
• Expanding by all terms effective• Expanding by most highly weighted terms
is not!
![Page 26: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/26.jpg)
C.Watters csci6403 26
Relevance Feedback for Boolean
• Examine terms in relevant docs• Discover conjuncts (t1 and t2)
– Phrase detection– Persistent co-occurrences (box car)
• Discover co-occurrences (t1 or t3)– Thesaurus– Occasional co-occurrences (auto car)– Co-occur with friends (auto & car car & sedan)
![Page 27: C.Watterscsci64031 Probabilistic Retrieval Model.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bf941a28abf838c906b0/html5/thumbnails/27.jpg)
C.Watters csci6403 27
Relevance Feedback Summary
• Can be very effective
• Need reasonable number of judged docs– Unpredictable results < 5 judged docs
• Can be used with both VSM and Boolean
• Requires either direct input from users or monitoring (time, printing, saving, etc)