Information Retrieval
-
Upload
rooney-lucas -
Category
Documents
-
view
12 -
download
0
description
Transcript of Information Retrieval
Information Retrieval
CSE 8337
Spring 2005
Query Operations
Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/Prof. Raymond J. Mooney in CS378 at University of TexasIntroduction to Modern Information Retrieval by Gerald Salton and Michael J. McGill,
1983, McGraw-Hill.Automatic Text Processing, by Gerard Salton, Addison-Wesley,1989.
CSE 8337 Spring 2005 2
Operations TOC Introduction Relevance Feedback
Query Expansion Term Reweighting
Automatic Local Analysis Query Expansion using Clustering
Automatic Global Analysis Query Expansion using Thesaurus
Similarity Thesaurus Statistical Thesaurua
Complete Link Algorithm
CSE 8337 Spring 2005 3
Query Operations Introduction
IR queries as stated by the user may not be precise or effective.
There are many techniques to improve a stated query and then process that query instead.
CSE 8337 Spring 2005 4
Relevance Feedback Use assessments by users as to the
relevance of previously returned documents to create new (modify old) queries.
Technique:1. Increase weights of terms from relevant
documents.2. Decrease weight of terms from nonrelevant
documents. Figure 10.4 in Automatic Text Processing Figure 6-10 in Introduction to Modern
Information Retrieval
CSE 8337 Spring 2005 5
Relevance Feedback
After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents.
Use this feedback information to reformulate the query.
Produce new results based on reformulated query.
Allows more interactive, multi-pass process.
CSE 8337 Spring 2005 6
Relevance Feedback Architecture
RankingsIR
System
Documentcorpus
RankedDocuments
1. Doc1 2. Doc2 3. Doc3 . .
1. Doc1 2. Doc2 3. Doc3 . .
Feedback
Query String
Revised
Query
ReRankedDocuments
1. Doc2 2. Doc4 3. Doc5 . .
QueryReformulation
CSE 8337 Spring 2005 7
Query Reformulation
Revise query to account for feedback: Query Expansion: Add new terms to
query from relevant documents. Term Reweighting: Increase weight of
terms in relevant documents and decrease weight of terms in irrelevant documents.
Several algorithms for query reformulation.
CSE 8337 Spring 2005 8
Query Reformulation for VSR
Change query vector using vector algebra.
Add the vectors for the relevant documents to the query vector.
Subtract the vectors for the irrelevant docs from the query vector.
This both adds both positive and negatively weighted terms to the query as well as reweighting the initial terms.
CSE 8337 Spring 2005 9
Optimal Query
Assume that the relevant set of documents Cr are known.
Then the best query that ranks all and only the relevant queries at the top is:
rjrj Cd
jrCd
jr
opt dCN
dC
q
11
Where N is the total number of documents.
CSE 8337 Spring 2005 10
Standard Rochio Method Since all relevant documents
unknown, just use the known relevant (Dr) and irrelevant (Dn) sets of documents and include the initial query q.
njrj Dd
jnDd
jr
m dD
dD
: Tunable weight for initial query.: Tunable weight for relevant documents.: Tunable weight for irrelevant documents.
CSE 8337 Spring 2005 11
Ide Regular Method
Since more feedback should perhaps increase the degree of reformulation, do not normalize for amount of feedback:
njrj Dd
jDd
jm ddqq
: Tunable weight for initial query.: Tunable weight for relevant documents.: Tunable weight for irrelevant documents.
CSE 8337 Spring 2005 12
Ide “Dec Hi” Method
Bias towards rejecting just the highest ranked of the irrelevant documents:
)(max jrelevantnonDd
jm ddqqrj
: Tunable weight for initial query.: Tunable weight for relevant documents.: Tunable weight for irrelevant document.
CSE 8337 Spring 2005 13
Comparison of Methods Overall, experimental results
indicate no clear preference for any one of the specific methods.
All methods generally improve retrieval performance (recall & precision) with feedback.
Generally just let tunable constants equal 1.
CSE 8337 Spring 2005 14
Fair Evaluation of Relevance Feedback
Remove from the corpus any documents for which feedback was provided.
Measure recall/precision performance on the remaining residual collection.
Compared to complete corpus, specific recall/precision numbers may decrease since relevant documents were removed.
However, relative performance on the residual collection provides fair data on the effectiveness of relevance feedback.
Fig 10.5 in Automatic Text Processing
CSE 8337 Spring 2005 15
Evaluating Relevance Feedback Test-and-control Collection Divide document collection in two parts Use test portion to perform relevance
feedback and to modify query Perform test on control portion using
both original and modified query Compare results
CSE 8337 Spring 2005 16
Why is Feedback Not Widely Used?
Users sometimes reluctant to provide explicit feedback.
Results in long queries that require more computation to retrieve, and search engines process lots of queries and allow little time for each one.
Makes it harder to understand why a particular document was retrieved.
CSE 8337 Spring 2005 17
Pseudo Feedback
Use relevance feedback methods without explicit user input.
Just assume the top m retrieved documents are relevant, and use them to reformulate the query.
Allows for query expansion that includes terms that are correlated with the query terms.
CSE 8337 Spring 2005 18
PseudoFeedback Results
Found to improve performance on TREC competition ad-hoc retrieval task.
Works even better if top documents must also satisfy additional boolean constraints in order to be used in feedback.
CSE 8337 Spring 2005 19
Term Reweighting for PM
Use statistics found in retrieved documents
Dr – Set of relevant and retrieved Dr,i – Set of relevant and retrieved that
contain ki. ,( | )
r i
ir
DP k R
D
,( | )
i r i
ir
n DP k R
N D
CSE 8337 Spring 2005 20
Term Reweighting
No query expansion Document term weights not used Query term weights not used Therefore, not usually as effective
as previous vector approach.
CSE 8337 Spring 2005 21
Local vs. Global Automatic Analysis
Local – Documents retrieved are examined to automatically determine query expansion. No relevance feedback needed.
Global – Thesaurus used to help select terms for expansion.
CSE 8337 Spring 2005 22
Automatic Local Analysis At query time, dynamically determine
similar terms based on analysis of top-ranked retrieved documents.
Base correlation analysis on only the “local” set of retrieved documents for a specific query.
Avoids ambiguity by determining similar (correlated) terms only within relevant documents. “Apple computer”
“Apple computer Powerbook laptop”
CSE 8337 Spring 2005 23
Automatic Local Analysis
Expand query with terms found in local clusters.
Dl – set of documents retireved for query q. Vl – Set of words used in Dl. Sl – Set of distinct stems in Vl. fsi,j –Frequency of stem si in document dj
found in Dl. Construct stem-stem association matrix.
CSE 8337 Spring 2005 24
Association Matrixw1 w2 w3 …………………..wn
w1
w2
w3
.
.wn
c11 c12 c13…………………c1n
c21
c31
.
.cn1
cij: Correlation factor between stems si and stem sj
k l
ij sik sjkd D
c f f
fik : Frequency of term i in document k
CSE 8337 Spring 2005 25
Normalized Association Matrix Frequency based correlation factor
favors more frequent terms. Normalize association scores:
Normalized score is 1 if two stems have the same frequency in all documents.
ijjjii
ijij ccc
cs
CSE 8337 Spring 2005 26
Metric Correlation Matrix
Association correlation does not account for the proximity of terms in documents, just co-occurrence frequencies within documents.
Metric correlations account for term proximity.
iu jvVk Vk vu
ij kkrc
),(
1
Vi: Set of all occurrences of term i in any document.r(ku,kv): Distance in words between word occurrences ku and kv
( if ku and kv are occurrences in different documents).
CSE 8337 Spring 2005 27
Normalized Metric Correlation Matrix
Normalize scores to account for term frequencies:
ji
ijij
VV
cs
CSE 8337 Spring 2005 28
Query Expansion with Correlation Matrix
For each term i in query, expand query with the n terms, j, with the highest value of cij (sij).
This adds semantically related terms in the “neighborhood” of the query terms.
CSE 8337 Spring 2005 29
Problems with Local Analysis
Term ambiguity may introduce irrelevant statistically correlated terms. “Apple computer” “Apple red fruit
computer”
Since terms are highly correlated anyway, expansion may not retrieve many additional documents.
CSE 8337 Spring 2005 30
Automatic Global Analysis
Determine term similarity through a pre-computed statistical analysis of the complete corpus.
Compute association matrices which quantify term correlations in terms of how frequently they co-occur.
Expand queries with statistically most similar terms.
CSE 8337 Spring 2005 31
Automatic Global Analysis There are two modern variants
based on a thesaurus-like structure built using all documents in collection Query Expansion based on a Similarity
Thesaurus Query Expansion based on a Statistical
Thesaurus
CSE 8337 Spring 2005 32
Thesaurus
A thesaurus provides information on synonyms and semantically related words and phrases.
Example: physician syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||sawbones
rel: medic, general practitioner, surgeon,
CSE 8337 Spring 2005 33
Thesaurus-based Query Expansion
For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus.
May weight added terms less than original query terms.
Generally increases recall. May significantly decrease precision,
particularly with ambiguous terms. “interest rate” “interest rate fascinate
evaluate”
CSE 8337 Spring 2005 34
Similarity Thesaurus The similarity thesaurus is based on term to term
relationships rather than on a matrix of co-occurrence.
This relationship are not derived directly from co-occurrence of terms inside documents.
They are obtained by considering that the terms are concepts in a concept space.
In this concept space, each term is indexed by the documents in which it appears.
Terms assume the original role of documents while documents are interpreted as indexing elements
CSE 8337 Spring 2005 35
Similarity Thesaurus The following definitions establish
the proper framework t: number of terms in the collection N: number of documents in the
collection fi,j: frequency of occurrence of the term
ki in the document dj tj: vocabulary of document dj
itfj: inverse term frequency for document dj
CSE 8337 Spring 2005 36
Similarity Thesaurus
Inverse term frequency for document dj
To ki is associated a vector
jj t
titf log
),....,,(k ,2,1,i Niii www
CSE 8337 Spring 2005 37
Similarity Thesaurus where wi,j is a weight associated to
index-document pair[ki,dj]. These weights are computed as follows
N
lj
lil
li
jjij
ji
ji
itff
f
itff
f
w
1
22
,
,
,
,
,
))(max
5.05.0(
))(max
5.05.0(
CSE 8337 Spring 2005 38
Similarity Thesaurus
The relationship between two terms ku and kv is computed as a correlation factor c u,v given by
The global similarity thesaurus is built through the computation of correlation factor cu,v for each pair of indexing terms [ku,kv] in the collection
jd
jv,ju,vuvu, wwkkc
CSE 8337 Spring 2005 39
Similarity Thesaurus This computation is expensive Global similarity thesaurus has to
be computed only once and can be updated incrementally
CSE 8337 Spring 2005 40
Query Expansion based on a Similarity Thesaurus
Query expansion is done in three steps as follows: Represent the query in the concept space
used for representation of the index terms2 Based on the global similarity thesaurus,
compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q.
3 Expand the query with the top r ranked terms according to sim(q,kv)
CSE 8337 Spring 2005 41
Query Expansion - step one To the query q is associated a vector
q in the term-concept space given by
where wi,q is a weight associated to the index-query pair[ki,q]
iqk
qi kwi
,q
CSE 8337 Spring 2005 42
Query Expansion - step two Compute a similarity sim(q,kv)
between each term kv and the user query q
where cu,v is the correlation factor
qk
vu,qu,vv
u
cwkq)ksim(q,
CSE 8337 Spring 2005 43
Query Expansion - step three Add the top r ranked terms according to sim(q,kv) to
the original query q to form the expanded query q’ To each expansion term kv in the query q’ is
assigned a weight wv,q’ given by
The expanded query q’ is then used to retrieve new documents to the user
qkqu,
vq'v,
u
w
)ksim(q,w
CSE 8337 Spring 2005 44
Query Expansion Sample Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A
c(A,A) = 10.991 c(A,C) = 10.781 c(A,D) = 10.781 ... c(D,E) = 10.398 c(B,E) = 10.396 c(E,E) = 10.224
CSE 8337 Spring 2005 45
Query Expansion Sample Query: q = A E E sim(q,A) = 24.298 sim(q,C) = 23.833 sim(q,D) = 23.833 sim(q,B) = 23.830 sim(q,E) = 23.435
New query: q’ = A C D E E w(A,q')= 6.88 w(C,q')= 6.75 w(D,q')= 6.75 w(E,q')= 6.64
CSE 8337 Spring 2005 46
WordNet A more detailed database of semantic
relationships between English words. Developed by famous cognitive
psychologist George Miller and a team at Princeton University.
About 144,000 English words. Nouns, adjectives, verbs, and adverbs
grouped into about 109,000 synonym sets called synsets.
CSE 8337 Spring 2005 47
WordNet Synset Relationships
Antonym: front back Attribute: benevolence good (noun to adjective) Pertainym: alphabetical alphabet (adjective to
noun) Similar: unquestioning absolute Cause: kill die Entailment: breathe inhale Holonym: chapter text (part-of) Meronym: computer cpu (whole-of) Hyponym: tree plant (specialization) Hypernym: fruit apple (generalization)
CSE 8337 Spring 2005 48
WordNet Query Expansion
Add synonyms in the same synset. Add hyponyms to add specialized
terms. Add hypernyms to generalize a
query. Add other related terms to expand
query.
CSE 8337 Spring 2005 49
Statistical Thesaurus
Existing human-developed thesauri are not easily available in all languages.
Human thesuari are limited in the type and range of synonymy and semantic relations they represent.
Semantically related terms can be discovered from statistical analysis of corpora.
CSE 8337 Spring 2005 50
Query Expansion Based on a Statistical Thesaurus
Global thesaurus is composed of classes which group correlated terms in the context of the whole collection
Such correlated terms can then be used to expand the original user query
This terms must be low frequency terms However, it is difficult to cluster low frequency terms To circumvent this problem, we cluster documents
into classes instead and use the low frequency terms in these documents to define our thesaurus classes.
This algorithm must produce small and tight clusters.
CSE 8337 Spring 2005 51
Query Expansion based on a Statistical Thesaurus
Use the thesaurus class to query expansion.
Compute an average term weight wtc for each thesaurus class C
C
wwtc
C
1iCi,
CSE 8337 Spring 2005 52
Query Expansion based on a Statistical Thesaurus
wtc can be used to compute a thesaurus class weight wc as
5.0C
wtcWc
CSE 8337 Spring 2005 53
Query Expansion Sample
TC = 0.90 NDC = 2.00 MIDF = 0.2
sim(1,3) = 0.99sim(1,2) = 0.40sim(1,2) = 0.40sim(2,3) = 0.29sim(4,1) = 0.00sim(4,2) = 0.00sim(4,3) = 0.00
Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A
C1
D1 D2D3 D4
C2C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
idf A = 0.0idf B = 0.3idf C = 0.12idf D = 0.12idf E = 0.60 q'=A B E
E
q= A E E
CSE 8337 Spring 2005 54
Query Expansion based on a Statistical Thesaurus
Problems with this approach initialization of parameters TC,NDC and MIDF TC depends on the collection Inspection of the cluster hierarchy is almost
always necessary for assisting with the setting of TC.
A high value of TC might yield classes with too few terms
CSE 8337 Spring 2005 55
Complete link algorithm
This is document clustering algorithm with produces small and tight clusters
Place each document in a distinct cluster. Compute the similarity between all pairs of
clusters. Determine the pair of clusters [Cu,Cv] with the
highest inter-cluster similarity. Merge the clusters Cu and Cv Verify a stop criterion. If this criterion is not met
then go back to step 2. Return a hierarchy of clusters.
Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents
CSE 8337 Spring 2005 56
Selecting the terms that compose each class
Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters
TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document frequency
CSE 8337 Spring 2005 57
Selecting the terms that compose each class
Use the parameter TC as threshold value for determining the document clusters that will be used to generate thesaurus classes
This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class
CSE 8337 Spring 2005 58
Selecting the terms that compose each class
Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered.
A low value of NDC might restrict the selection to the smaller cluster Cu+v
CSE 8337 Spring 2005 59
Selecting the terms that compose each class
Consider the set of document in each document cluster pre-selected above.
Only the lower frequency documents are used as sources of terms for the thesaurus classes
The parameter MIDF defines the minimum value of inverse document frequency for any term which is selected to participate in a thesaurus class
CSE 8337 Spring 2005 60
Global vs. Local Analysis
Global analysis requires intensive term correlation computation only once at system development time.
Local analysis requires intensive term correlation computation for every query at run time (although number of terms and documents is less than in global analysis).
But local analysis gives better results.
CSE 8337 Spring 2005 61
Query Expansion Conclusions
Expansion of queries with related terms can improve performance, particularly recall.
However, must select similar terms very carefully to avoid problems, such as loss of precision.
CSE 8337 Spring 2005 62
Conclusion Thesaurus is a efficient method to
expand queries The computation is expensive but it is
executed only once Query expansion based on similarity
thesaurus may use high term frequency to expand the query
Query expansion based on statistical thesaurus need well defined parameters