Information Retrieval: Models and Methods
description
Transcript of Information Retrieval: Models and Methods
![Page 1: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/1.jpg)
Information Retrieval:Models and Methods
October 15, 2003
CMSC 35900
Gina-Anne Levow
![Page 2: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/2.jpg)
Roadmap
• Problem: – Matching Topics and Documents
• Methods:– Classic: Vector Space Model– N-grams– HMMs
• Challenge: Beyond literal matching– Expansion Strategies– Aspect Models
![Page 3: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/3.jpg)
Matching Topics and Documents
• Two main perspectives:– Pre-defined, fixed, finite topics:
• “Text Classification”
– Arbitrary topics, typically defined by statement of information need (aka query)
• “Information Retrieval”
![Page 4: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/4.jpg)
Matching Topics and Documents
• Documents are “about” some topic(s)• Question: Evidence of “aboutness”?
– Words !!• Possibly also meta-data in documents
– Tags, etc
• Model encodes how words capture topic– E.g. “Bag of words” model, Boolean matching– What information is captured?– How is similarity computed?
![Page 5: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/5.jpg)
Models for Retrieval and Classification
• Plethora of models are used
• Here:– Vector Space Model– N-grams– HMMs
![Page 6: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/6.jpg)
Vector Space Information Retrieval
• Task:– Document collection– Query specifies information need: free text– Relevance judgments: 0/1 for all docs
• Word evidence: Bag of words– No ordering information
![Page 7: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/7.jpg)
Vector Space Model
• Represent documents and queries as– Vectors of term-based features
• Features: tied to occurrence of terms in collection
– E.g.
• Solution 1: Binary features: t=1 if presense, 0 otherwise– Similiarity: number of terms in common
• Dot product
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd
ji
N
ikijk ttdqsim ,
1,),(
![Page 8: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/8.jpg)
Vector Space Model II
• Problem: Not all terms equally interesting– E.g. the vs dog vs Levow
• Solution: Replace binary term features with weights– Document collection: term-by-document matrix
– View as vector in multidimensional space• Nearby vectors are related
– Normalize for vector length
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd
![Page 9: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/9.jpg)
Vector Similarity Computation
• Similarity = Dot product
• Normalization:– Normalize weights in advance– Normalize post-hoc
ji
N
ikijkjk wwdqdqsim ,
1,),(
N
i ji
N
i ki
N
i jikijk
ww
wwdqsim
1
2,1
2,
1 ,,),(
![Page 10: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/10.jpg)
Term Weighting
• “Aboutness”– To what degree is this term what document is about?– Within document measure– Term frequency (tf): # occurrences of t in doc j
• “Specificity”– How surprised are you to see this term?– Collection frequency– Inverse document frequency (idf):
)log(i
i n
Nidf
ijiji idftfw ,,
![Page 11: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/11.jpg)
Term Selection & Formation
• Selection:– Some terms are truly useless
• Too frequent, no content– E.g. the, a, and,…
– Stop words: ignore such terms altogether
• Creation:– Too many surface forms for same concepts
• E.g. inflections of words: verb conjugations, plural
– Stem terms: treat all forms as same underlying
![Page 12: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/12.jpg)
N-grams
• Simple model
• Evidence: More than bag of words– Captures context, order information
• E.g. White House
• Applicable to many text tasks– Language identification, authorship attribution,
genre classification, topic/text classification– Language modeling for ASR,etc
![Page 13: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/13.jpg)
Text Classification with N-grams
• Task: Classes identified by document sets– Assign new documents to correct class
• N-gram categorization:– Text: D; category: – Select c maximizing posterior probability
},...,{ ||1 CccCc
)},...,|(Pr{maxarg
)}|{Pr(maxarg
)}Pr()|{Pr(maxarg
)}|{Pr(maxarg
111
*
inii
N
ic
Cc
Cc
Cc
Cc
www
cD
ccD
Dcc
![Page 14: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/14.jpg)
Text Classification with N-grams
• Representation:– For each class, train N-gram model
• “Similarity”: For each document D to classify, select c with highest likelihood– Can also use entropy/perplexity
![Page 15: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/15.jpg)
Assessment & Smoothing
• Comparable to “state of the art” – 0.89 Accuracy
• Reliable – Across smoothing techniques– Across languages – generalizes to Chinese characters
n Abs G-T Lin W-B
4 0.88 0.88 0.87 0.89
5 0.89 0.87 0.88 0.89
6 0.89 0.88 0.88 0.89
![Page 16: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/16.jpg)
HMMs
• Provides a generative model of topicality– Solid probabilistic framework rather than ad
hoc weighting
• Noisy channel model:– View query Q as output of underlying relevant
document D, passed through mind of user
![Page 17: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/17.jpg)
HMM Information Retrieval
• Task: Given user generated query Q, return ranked list of relevant documents
• Model:
– Maximize Pr(D is Relevant) for some query Q– Output symbols: terms in document collection– States: Process to generate output symbols
• From document D• From General English
)Pr(
)Pr()|Pr()|Pr(
Q
DisRDisRQQDisR
GeneralEnglish
Document
Querystart
Queryend
a
b
Pr(q|GE)
Pr(q|D)
![Page 18: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/18.jpg)
HMM Information Retrieval
• Generally use EM to train transition and output probabilities– E.g query-relevant document pairs – Data often insufficient
• Simplified strategy:– EM for transition, assume same across docs– Output distributions:
k
Dqk lengthD
tfDq k,)|Pr(
k k
k Dq
lengthD
tfGEq k,)|Pr(
DqbGEqaDisRQ ))|Pr()|Pr(()|Pr(
![Page 19: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/19.jpg)
EM Parameter Update
a
a
a ‘
a ‘ b ‘ English
![Page 20: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/20.jpg)
Evaluation
• Comparison to VSM– HMM can outperform VSM
• Some variation related to implementation
– Can integrate other features –e.g. bigram or trigram models,
![Page 21: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/21.jpg)
Key Issue
• All approaches operate on term matching– If a synonym, rather than original term, is used,
approach fails
• Develop more robust techniques– Match “concept” rather than term
• Expansion approaches– Add in related terms to enhance matching
• Mapping techniques– Associate terms to concepts
» Aspect models, stemming
![Page 22: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/22.jpg)
Expansion Techniques
• Can apply to query or document
• Thesaurus expansion– Use linguistic resource – thesaurus, WordNet
– to add synonyms/related terms
• Feedback expansion– Add terms that “should have appeared”
• User interaction– Direct or relevance feedback
• Automatic pseudo relevance feedback
![Page 23: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/23.jpg)
Query Refinement
• Typical queries very short, ambiguous– Cat: animal/Unix command– Add more terms to disambiguate, improve
• Relevance feedback– Retrieve with original queries– Present results
• Ask user to tag relevant/non-relevant
– “push” toward relevant vectors, away from nr
– β+γ=1 (0.75,0.25); r: rel docs, s: non-rel docs– “Roccio” expansion formula
S
kk
R
jjii sS
rR
qq11
1
![Page 24: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/24.jpg)
Compression Techniques
• Reduce surface term variation to concepts• Stemming
– Map inflectional variants to root• E.g. see, sees, seen, saw -> see• Crucial for highly inflected languages – Czech, Arabic
• Aspect models– Matrix representations typically very sparse– Reduce dimensionality to small # key aspects
• Mapping contextually similar terms together• Latent semantic analysis
![Page 25: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/25.jpg)
Latent Semantic Analysis
![Page 26: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/26.jpg)
Latent Semantic Analysis
![Page 27: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/27.jpg)
LSI
![Page 28: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/28.jpg)
Classic LSI Example (Deerwester)
![Page 29: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/29.jpg)
SVD: Dimensionality Reduction
![Page 30: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/30.jpg)
LSI, SVD, & Eigenvectors
• SVD decomposes:– Term x Document matrix X as
• X=TSD’– Where T,D left and right singular vector matrices, and– S is a diagonal matrix of singular values
• Corresponds to eigenvector-eigenvalue decompostion: Y=VLV’
– Where V is orthonormal and L is diagonal
• T: matrix of eigenvectors of Y=XX’• D: matrix of eigenvectors of Y=X’X• S: diagonal matrix L of eigenvalues
![Page 31: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/31.jpg)
Computing Similarity in LSI
![Page 32: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/32.jpg)
SVD details
![Page 33: Information Retrieval: Models and Methods](https://reader035.fdocuments.us/reader035/viewer/2022062722/56813ae1550346895da33084/html5/thumbnails/33.jpg)
SVD Details (cont’d)