Information Retrieval and Vector Space Model Presented by Jun Miao York University
description
Transcript of Information Retrieval and Vector Space Model Presented by Jun Miao York University
![Page 1: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/1.jpg)
Information Retrieval and Vector Space Model
Presented by Jun Miao
York University
1
![Page 2: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/2.jpg)
2
![Page 3: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/3.jpg)
What is Information Retrieval?What is Information Retrieval?
= IR ?IR: Retrieve information which is relative to
your need Search Engine Question Answering Information Extraction Information Filtering Information Recommendation
3
![Page 4: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/4.jpg)
In old days…In old days…The term "information retrieval" may
have been coined by Calvin Mooers
Early IR applications are used in libraries
Set-based retrieval the system partitions the corpus into two subsets
of documents: those it considers relevant to the search query, and those it does not.
4
![Page 5: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/5.jpg)
In nowadaysIn nowadaysRanked Retrieval the system responds to a search query by
ranking all documents in the corpus based on its estimate of their relevance to the query.
◦free-form query expresses user’s information need
◦rank documents by decreasing likelihood of relevance
◦many studies prove it is superior
5
![Page 6: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/6.jpg)
An Information Retrieval An Information Retrieval Process Process (Borrow from Prof. Nie’s slides)
6
Document collection
Info. need
Query
Answer list
IR system
Retrieval
![Page 7: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/7.jpg)
Inside a IR systemInside a IR system
7
![Page 8: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/8.jpg)
Indexing DocumentIndexing Document
8
![Page 9: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/9.jpg)
Lexical AnalysisLexical AnalysisWhat counts as a word or token
in the indexing scheme?
A big topic
9
![Page 10: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/10.jpg)
Stop ListStop List function words do not bear useful
information for IRof, not, to, or, in, about, with, I, be, …
Stop list: contain stop words, not to be used as index◦ Prepositions◦ Articles◦ Pronouns◦ Some adverbs and adjectives◦ Some frequent words (e.g. document)
The removal of stop words usually improves IR effectiveness
A few “standard” stop lists are commonly used.
10
![Page 11: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/11.jpg)
StemmingStemming
11
Reason: ◦ Different word forms may bear similar meaning
(e.g. search, searching): create a “standard” representation for them
Stemming: ◦ Removing some endings of word
dancer dancers
dancedanceddancing
dance
![Page 12: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/12.jpg)
Stemming(Cont’d)Stemming(Cont’d)Two main methods : Linguistic/dictionary-based
stemming high stemming accuracyhigh implementation and processing costs
and higher coverage
Porter-style stemming
lower stemming accuracylower implementation and processing costs
and lower coverageUsually sufficient for IR
12
![Page 13: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/13.jpg)
Flat file indexingFlat file indexingEach document is represented by
a set of weighted keywords (terms):
D1 {(t1, w1), (t2,w2), …}
e.g. D1 {(comput, 0.2), (architect,
0.3), …}D2 {(comput, 0.1), (network,
0.5), …}
13
![Page 14: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/14.jpg)
Inverted IndexInverted Index
14
![Page 15: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/15.jpg)
Query AnalysisQuery AnalysisParse QueryClean StopwordsStemmingGet termsAdjacent operations
connect related terms together
15
![Page 16: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/16.jpg)
ModelsModelsMatching score model
◦Document D = a set of weighted keywords
◦Query Q = a set of non-weighted keywords
◦R(D, Q) = i w(ti , D)
where ti is in Q.
16
![Page 17: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/17.jpg)
Models(Cont’d)Models(Cont’d)Boolean ModelVector Space ModelProbability ModelLanguage ModelNeural Network ModelFuzzy Set Model……
17
![Page 18: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/18.jpg)
tf*idf weighting schematf*idf weighting schematf = term frequency
◦ frequency of a term/keyword in a documentThe higher the tf, the higher the importance (weight) for the doc.
df = document frequency◦ no. of documents containing the term◦ distribution of the term
idf = inverse document frequency◦ the unevenness of term distribution in the corpus◦ the specificity of term to a document◦ Idf = log(d/df) d= total number of documentsThe more the term is distributed evenly, the less it is specific to a document
weight(t,D) = tf(t,D) * idf(t)
18
![Page 19: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/19.jpg)
EvaluationEvaluationA result list according to a query
What is its performance?
retrieved relevant
Relevant Retrieve
d
19
![Page 20: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/20.jpg)
Metrics often used Metrics often used (together):(together):
Precision = retrieved relevant docs / retrieved docs
Recall = retrieved relevant docs / relevant docs
20
![Page 21: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/21.jpg)
Precision-Recall Trade-offPrecision-Recall Trade-off
Precision 1.0 Recall 1.0
Usually, more precision, less recall; More recall, less precisionReturn all documents: recall rate = 1 precision is very low
21
![Page 22: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/22.jpg)
For Ranked ListFor Ranked ListConsider two result lists of two IR systems S1
and S2 according to one query:
1.
2.
Which one is better???
relevant documents
relevant documents
22
![Page 23: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/23.jpg)
Average PrecisionAverage PrecisionAP = sum(R(xi)/P(xi)) / n Xi ∈ Set of retrieved relative documents
P(xi) : Rank of xi in retrieved list
R(xi) : Rank of xi in retrieved relative document list
n : Number of retrieved relative documents
List 1:
AP1 = ((1/1)+(2/3)+(3/6)+(4/9)+(5/10))/5 = 0.622
relevant documents
23
![Page 24: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/24.jpg)
Average Precision (Cont’d)Average Precision (Cont’d)List 2
AP2 = ( (1/1)+(2/2)+(3/3)+(4/5)+(5/6) ) / 5 = 0.927
S2 is better than S1
relevant documents
24
![Page 25: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/25.jpg)
Evaluating over multiple Evaluating over multiple queriesqueriesMean Average Precision: Arithmetic mean of average precisions over all
queries
5 Queries (Topics) and 2 IR systems
S1 is better than S2
AP1 AP2 AP3 AP4 AP5 MAP
S1 0.7 0.8 0.9 0.3 0.5 0.64
S2 0.9 0.9 0.2 0.3 0.4 0.54
25
![Page 26: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/26.jpg)
Other MeasurementsOther MeasurementsPrecision@NR-PrecisionF-measurementE-measurement……
26
![Page 27: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/27.jpg)
ProblemProblemSometimes, documents in the
collections are numerous. It is hard to calculate recall rate.
27
![Page 28: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/28.jpg)
PoolingPoolingStep 1. Get top N documents
from the results of IR systems to make a document pool.
Step 2. Experts check the pool, and tag these documents by relevant or non-relevant according to different topics
28
![Page 29: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/29.jpg)
Difficulties in text IRDifficulties in text IRVocabularies mismatching
◦ Synonymy: e.g. car v.s. automobile◦ Polysemy: table
Queries are ambiguous, they are partial specification of user’s need
Content representation may be inadequate and incomplete
The user is the ultimate judge, but we don’t know how the judge judges…◦ The notion of relevance is imprecise, context- and
user-dependent
29
![Page 30: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/30.jpg)
Difficulties in web IRDifficulties in web IRNo stable document collection
(spider, crawler)Invalid document, duplication, etc.Huge number of documents (partial
collection)Multimedia documentsGreat variation of document qualityMultilingual problem…
30
![Page 31: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/31.jpg)
NLP in IRNLP in IRSimple methods: stop word,
stemmingHigher-level processing:
chunking, parsing, word sense disambiguation
Research about using NLP in IR needs more attention
31
![Page 32: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/32.jpg)
Popular systemsPopular systemsSMART http://ftp.cs.cornell.edu/pub/smart/
Terrier http://ir.dcs.gla.ac.uk/terrier/
Okapi
http://www.soi.city.ac.uk/~andym/OKAPIPACK/index.html
Lemur http://www-2.cs.cmu.edu/~lemur/ etc…
32
![Page 33: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/33.jpg)
Conference and JournalConference and JournalConference
SIGIR TREC CLEF WWW ECIR
… Journal
ACM Transactions on Information Systems(TOIS) ACM Transactions on Asian Language Information
Processing(TALIP) Information Processing & Management(IP&M) Information Retrieval
33
![Page 34: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/34.jpg)
34
![Page 35: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/35.jpg)
IdeaIdeaConvert documents and queries into
vectors, and use Similarity Coefficient(SC) to measure the similarity
Presented by Gerard Salton et al. in 1975, implemented in SMART IR system
Premise: all terms are independent 35
![Page 36: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/36.jpg)
Construct VectorConstruct Vector
Each dimension corresponds to a separate term.
Wi,j = weight of term j in document or query i
36
![Page 37: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/37.jpg)
Doc-Term MatrixDoc-Term MatrixN documents and M terms
D1 D2 D3 … Dn
T1 W1,1 W2,1 W3,1 … Wn,1
T2 W1,2 W2,2 W3,2 … Wn,2
T3 W1,3 W2,3 W3,3 … Wn,3
… … … … … …
Tm W1,m W2,m W3,m … Wn,m
37
![Page 38: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/38.jpg)
Three Key problemsThree Key problems
1.Term selection 2.Term weighting 3.Similarity Coefficient Calculation
38
![Page 39: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/39.jpg)
Term SelectionTerm SelectionTerms represent the content of
documentsTerm purification
StemmingStoplistOnly choose Nouns
39
![Page 40: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/40.jpg)
Term WeightTerm WeightBoolean weight: 1: appear 0:
not appearTerm Frequency:
tf 1+log(tf) 1+(1+log(tf))
Inverse Document Frequency tf*idf
40
![Page 41: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/41.jpg)
Term Weight (Cont’d)Term Weight (Cont’d)Document LengthTwo opinions:
Longer documents contain more terms
Longer documents have more information
Punish long documents and compensate to short documents
Pivoted Normalization : 1-b+b*doclen/avgdoclen b in (0,1)
41
![Page 42: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/42.jpg)
Similarity Coefficient Similarity Coefficient CalculationCalculation
42
Dot product
Cosine
Dice
Jaccard
i i iiiii
iii
i iii
iii
i iii
iii
ii
baba
baQDSC
ba
baQDSC
ba
baQDSC
baQDSC
) * (
) * (),(
) * (2),(
*
) * (),(
) * (),(
22
22
22
t1
t2
D
Q
![Page 43: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/43.jpg)
ExampleExampleQ: “gold silver truck”
• D1: “Shipment of gold delivered in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck” Document Frequency of the jth term (dfj )
• Inverse Document Frequency (idf) = log10(n / dfj)
Tf*idf is used as term weight here
43
![Page 44: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/44.jpg)
Example (Cont’d)Example (Cont’d)Id Term df idf
1 a 3 0
2 arrived 2 0.176
3 damaged 1 0.477
4 delivery 1 0.477
5 fire 1 0.477
6 gold 1 0.176
7 in 3 0
8 of 3 0
9 silver 1 0.477
10 shipment 2 0.176
11 truck 2 0.176
44
![Page 45: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/45.jpg)
Example(Cont’d)Example(Cont’d)Tf*idf is used here
SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031
SC(Q, D2 ) = 0.486
SC(Q,D3) = 0.062
The ranking would be D2,D3,D1.
• This SC uses the dot product.
45
![Page 46: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/46.jpg)
Advantages of VSMAdvantages of VSM Fairly cheap to compute Yields decent effectiveness Very popular -- SMART is one of
the most commonly used academic prototype
46
![Page 47: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/47.jpg)
Disadvantages of VSMDisadvantages of VSM No theoretical foundation Weights in the vectors are very
arbitrary Assumes term independenceSparse Matrix
47
![Page 48: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/48.jpg)
48
![Page 49: Information Retrieval and Vector Space Model Presented by Jun Miao York University](https://reader033.fdocuments.us/reader033/viewer/2022051517/568152db550346895dc0f7dc/html5/thumbnails/49.jpg)
49