1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or...
-
Upload
carlo-wires -
Category
Documents
-
view
212 -
download
0
Transcript of 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or...
![Page 1: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/1.jpg)
1
Information Retrieval
CSE 454
![Page 2: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/2.jpg)
Administrivia
• Project & Group Status– How many need group (or have group, but need
more teammates?)– Which groups are planning non-default topics?
• End a few minutes early– Group completion– Project questions & feedback
2
![Page 3: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/3.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
3
![Page 4: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/4.jpg)
4
Categorization
• Given:– A description of an instance, xX, where X is
the instance language or instance space.– A fixed set of categories:
C={c1, c2,…cn}
• Determine:– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and whose range is C.
![Page 5: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/5.jpg)
Example: County vs. Country?
5
• Given:– A description of an instance, xX,
where X is the instance language or instance space.
– A fixed set of categories: C={c1, c2,…cn}
• Determine:– The category of x: c(x)C, where c(x)
is a categorization function whose domain is X and whose range is C.
![Page 6: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/6.jpg)
6
Learning for Categorization• A training example is an instance xX, paired
with its correct category c(x): <x, c(x)> for an unknown categorization function, c.
• Given a set of training examples, D.
• Find a hypothesized categorization function, h(x), such that: )()(: )(, xcxhDxcx
Consistency
{< , county>, < , country>,…
![Page 7: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/7.jpg)
© Daniel S. Weld 7
Why is Learning Possible?
Experience alone never justifies any conclusion about any unseen instance.
Learning occurs when
PREJUDICE meets DATA! “Bias”
What is the bias of the Naïve Bayes classifier?
![Page 8: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/8.jpg)
1702-1761)(
)|()()|(
EP
cEPcPEcP ii
i
• Need to know:– Priors: P(ci), Conditionals: P(E | ci)
• P(ci) are easily estimated from data.
• Bag of words repr. & assumption of cond indep
• Only need to know P(ej | ci) for each feature & category.
)|()|()|(1
21
m
jijimi cePceeePcEP
Bayes ClassifierNaïve Bayes Theorem
![Page 9: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/9.jpg)
9
Details
• Smoothing– To account for estimation from small samples,
probability estimates are adjusted or smoothed.
– For binary features, p is simply assumed to be 0.5
• Preventing Underflow– log(xy) = log(x) + log(y),
mn
mpnceP
i
ijij
)|( = (nij + 1) / (ni + 2)
![Page 10: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/10.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
10
![Page 11: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/11.jpg)
11
Evaluating Categorization
• Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
• Classification accuracy: c/n where – n is the total number of test instances, – c is the number of correctly classified test instances.
• Results can vary based on sampling error due to different training and test sets.– Bummer… what should we do?
• Average results over multiple training and test sets (splits of the overall data) for the best results.– Bummer… that means we need lots of labeled data…
![Page 12: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/12.jpg)
12
N-Fold Cross-Validation
• Ideally: test, training sets are independent on each trial.– But this would require too much labeled data.
• Cool idea:– Partition data into N equal-sized disjoint segments.– Run N trials, each time hold back a different segment for testing – Train on the remaining N1 segments.
• This way, at least test-sets are independent.• Report average classification accuracy over the N trials.• Typically, N = 10.
Also nice to report standard
deviation of averages
![Page 13: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/13.jpg)
13
Cross validation
• Partition examples into k disjoint equiv classes• Now create k training sets
– Each set is union of all equiv classes except one– So each set has (k-1)/k of the original training data
Train Test
![Page 14: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/14.jpg)
14
Cross Validation
• Partition examples into k disjoint equiv classes• Now create k training sets
– Each set is union of all equiv classes except one– So each set has (k-1)/k of the original training data
Test
![Page 15: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/15.jpg)
15
Cross Validation
• Partition examples into k disjoint equiv classes• Now create k training sets
– Each set is union of all equiv classes except one– So each set has (k-1)/k of the original training data
Test
![Page 16: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/16.jpg)
16
Learning Curves
• In practice, labeled data is usually rare and expensive.
• Would like to know how performance varies with the number of training instances.
• Learning curves plot classification accuracy on independent test data (Y axis) versus number of training examples (X axis).
![Page 17: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/17.jpg)
17
N-Fold Learning Curves
• Want learning curves averaged over multiple trials.
• Use N-fold cross validation to generate N full training and test sets.
• For each trial, – train on increasing fractions of the training set– measure accuracy on the test data
• for each point on the desired learning curve.
![Page 18: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/18.jpg)
18
Sample Learning Curve(Yahoo Science Data)
![Page 19: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/19.jpg)
Syllabus• Text Processing Tasks
– Classification– Retrieval (Similarity Measures)– Extraction (NL text database records)
• Techniques– Machine Learning– Vector-Space Model– Syntactic Analysis – Hyperlinks & Web Structure
• Scaling – Parallelism– Indexing
• Special Topics
04/11/23 02:18 •19
![Page 20: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/20.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
20Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 21: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/21.jpg)
Query
• Which plays of Shakespeare contain Brutus AND Caesar but NOT Calpurnia?
• Could grep all of Shakespeare’s plays for Brutus and Caesar then strip out lines containing Calpurnia?– Slow (for large corpora)– Other operations (e.g., find the Romans NEAR
countrymen) not feasible
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 22: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/22.jpg)
Term-document incidence
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 23: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/23.jpg)
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 24: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/24.jpg)
Answers to query
• Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]:
Why, Enobarbus, When Antony found
Julius Caesar dead, He cried almost to roaring;
and he wept when at Philippi he found
Brutus slain.
• Hamlet, Act III, Scene ii…
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 25: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/25.jpg)
Bigger corpora
• Consider n = 1M documents, – Each with about 1K terms.
• Avg 6 bytes/term incl spaces/punctuation – 6GB of data.
• Say there are m = 500K distinct terms
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 26: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/26.jpg)
Can’t build the matrix
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
• But it has no more than one billion 1’s.– matrix is extremely sparse.
• What’s a better representation?
Why?
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 27: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/27.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
27Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 28: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/28.jpg)
• Documents are parsed to extract words and these are saved with the document ID.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Inverted index
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 29: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/29.jpg)
• After all documents have been parsed the inverted file is sorted by terms
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 30: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/30.jpg)
• Multiple term entries in a single document are merged and frequency information added
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 31: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/31.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
31Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 32: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/32.jpg)
Issues with index we just built
• How do we process a query?
• What terms in a doc do we index?– All words or only “important” ones?
• Stopword list: terms that are so common that they’re ignored for indexing.– e.g., the, a, an, of, to …– language-specific.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 33: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/33.jpg)
Issues in what to index
• Cooper’s vs. Cooper vs. Coopers.
• Full-text vs. full text vs. {full, text} vs. fulltext.
• Accents: résumé vs. resume.
Cooper’s concordance of Wordsworth was published in 1911. The applications of full-text retrieval are legion: they include résumé
scanning, litigation support and searching published journals on-line.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 34: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/34.jpg)
Punctuation
• Ne’er: use language-specific, handcrafted “locale” to normalize.
• State-of-the-art: break up hyphenated sequence.
• U.S.A. vs. USA - use locale.
• a.out
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 35: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/35.jpg)
Numbers
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144– Generally, don’t index as text– Creation dates for docs
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 36: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/36.jpg)
Case folding
• Reduce all letters to lower case– exception: upper case in mid-sentence
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 37: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/37.jpg)
Thesauri and soundex
• Handle synonyms and homonyms– Hand-constructed equivalence classes
• e.g., car = automobile
• your & you’re
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 38: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/38.jpg)
Spell correction
• Look for all words within (say) edit distance 3 (Insert/Delete/Replace) at query time– e.g., Alanis Morisette
• Spell correction is expensive and slows the query (up to a factor of 100)– Invoke only when index returns zero matches?– What if docs contain mis-spellings?
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 39: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/39.jpg)
Lemmatization
• Reduce inflectional/variant forms to base form
• E.g.,– am, are, is be
– car, cars, car's, cars' car
• the boy's cars are different colors
• the boy car be different color
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 40: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/40.jpg)
Stemming
• Reduce terms to their “roots” before indexing– language dependent– e.g., automate(s), automatic, automation all
reduced to automat.
for example compressed and compression are both accepted as equivalent to
compress.
for exampl compres andcompres are both acceptas equival to compres.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 41: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/41.jpg)
Porter’s algorithm
• Commonest algorithm for stemming English
• Conventions + 5 phases of reductions– phases applied sequentially– each phase consists of a set of commands– sample convention: Of the rules in a compound
command, select the one that applies to the longest suffix.
• Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 42: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/42.jpg)
Typical rules in Porter
• sses ss
• ies i
• ational ate
• tional tion
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 43: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/43.jpg)
Beyond term search
• What about phrases?
• Proximity: Find Gates NEAR Microsoft.– Need index to capture position information in
docs.
• Zones in documents: Find documents with (author = Ullman) AND (text contains automata).
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 44: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/44.jpg)
Evidence accumulation
• 1 vs. 0 occurrence of a search term– 2 vs. 1 occurrence– 3 vs. 2 occurrences, etc.
• Need term frequency information in docs
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 45: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/45.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
45
![Page 46: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/46.jpg)
Ranking search results
• Boolean queries give inclusion or exclusion of docs.
• Want proximity from query to each doc.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 47: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/47.jpg)
Precision and Recall
• Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
• Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
• Precision P = tp/(tp + fp)
• Recall R = tp/(tp + fn)
Relevant Not Relevant
Retrieved tp fp
Not Retrieved
fn tn
![Page 48: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/48.jpg)
Precision & Recall
• Precision
– Proportion of selected items that are correct
• Recall– Proportion of target items
that were selected
• Precision-Recall curve– Shows tradeoff
tn
fp tp fn
System returned these
Actual relevant docsfptp
tp
fntp
tp
Recall
Precision
![Page 49: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/49.jpg)
Precision/Recall
• Easy to get high precision– (low recall)
• Easy to get high recall – (but low precision)
• Recall is non-decreasing function of # docs retrieved– Precision usually decreases (in a good system)
• Difficulties in using precision/recall – Need human relevance judgements– Binary relevance– Skewed by corpus/authorship
• Must average over large corpus / query set
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 50: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/50.jpg)
A combined measure: F
• Combined measure that assesses P/R tradeoff is F measure (weighted harmonic mean):
• People usually use balanced F1 measure– i.e., with = 1 or = ½
• Harmonic mean is conservative average– See CJ van Rijsbergen, Information Retrieval
RP
PR
RP
F
2
2 )1(1
)1(1
1
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 51: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/51.jpg)
Precision-Recall Curves
• Evaluation of ranked results:– You can return any number of results ordered
by similarity– By taking various numbers of documents
(levels of recall), you can produce a precision-recall curve
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 52: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/52.jpg)
Precision-recall curves
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 53: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/53.jpg)
Evaluation
• There are various other measures– Precision at fixed recall
• This is perhaps the most appropriate thing for web search: all people want to know is how many good matches there are in the first one or two pages of results
– 11-point interpolated average precision• The standard measure in the TREC competitions: Take
the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 54: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/54.jpg)
Ranking models in IR
• Key idea:– We wish to return in order the documents most likely
to be useful to the searcher• To do this, we want to know which documents
best satisfy a query– An obvious idea is that if a document talks about a
topic more then it is a better match• Must a document have all of the terms?
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 55: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/55.jpg)
Binary term presence matrices
• Record whether a document contains a word: document is binary vector in {0,1}v
• Idea: Query satisfaction = overlap measure:
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
YX
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 56: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/56.jpg)
Overlap matching
• What are the problems with the overlap measure?
• It doesn’t consider:– Term frequency in document– Term scarcity in collection
• (How many documents mention term?)
– Length of documents
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 57: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/57.jpg)
Many Overlap Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 58: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/58.jpg)
Topics for Today
• Review
• Cross Validation
• IR & Incidence Vectors
• Inverted Indicies
• Special Cases
• Evaluation
• Vector-Space Model
58
![Page 59: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/59.jpg)
Documents as vectors
• Each doc j can be viewed as a vector of tf values, one component for each term
• So we have a vector space– terms are axes– docs live in this space– even with stemming, may have 20,000+ dimensions
• (The corpus of documents gives us a matrix, which we could also view as a vector space in which words live – transposable data)
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 60: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/60.jpg)
Documents in 3D Space
Assumption: Documents that are “close together” in space are similar in meaning.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 61: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/61.jpg)
The vector space model
Query as vector:• Regard query as short document• Return the docs, ranked by distance to the query• Easy to compute, since both query & docs are
vectors.
• Developed in the SMART system (Salton, c. 1970) and standardly used by TREC participants and web IR systems
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 62: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/62.jpg)
Vector Representation
• Documents & Queries represented as vectors.
• Position 1 corresponds to term 1, …position t to term t
• The weight of the term is stored in each position
• Vector distance measure used to rank retrieved documents
absent is terma if 0
,...,,
,...,,
21
21
w
wwwQ
wwwD
qtqq
dddi itii
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 63: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/63.jpg)
Documents in 3D Space
Documents that are close to query (measured using vector-space metric)
=> returned first.
Query
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 64: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/64.jpg)
Document Space has High Dimensionality
• What happens beyond 2 or 3 dimensions?– Similarity still has to do with the number of shared
tokens.– More terms -> harder to understand which subsets of
words are shared among similar documents.
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 65: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/65.jpg)
Word Frequency
• Which word is more indicative of document similarity? – ‘book,’ or ‘Rumplestiltskin’?– Need to consider “document frequency”---how
frequently the word appears in doc collection.
• Which doc is a better match for the query “Kangaroo”?– One w/ a single mention of Kangaroos… or 10 times?– Need to consider “term frequency”---how many times
the word appears in the current document.Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 66: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/66.jpg)
TF x IDF
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
![Page 67: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/67.jpg)
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 68: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/68.jpg)
TF-IDF normalization
• Normalize the term weights – so longer docs not given more weight (fairness)– force all values to fall within a certain range: [0, 1]
t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 69: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/69.jpg)
Vector space similarity(use the weights to compare the documents)
terms.) thehting when weigdone tion was(Normaliza
product.inner normalizedor cosine, thecalled also is This
),(
:is documents twoof similarity theNow,
1
t
kjkikji wwDDsim
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 70: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/70.jpg)
What’s Cosine anyway?
One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arc
endpoint. As a result of this definition, the cosine function is periodic with period 2pi.
From http://mathworld.wolfram.com/Cosine.htmlBased on slides by P. Raghavan, H. Schütze, R. Larson
![Page 71: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/71.jpg)
Cosine Detail (degrees)
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 72: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/72.jpg)
Computing Cosine Similarity Scores
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 73: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/73.jpg)
Computing a similarity score
98.0 42.0
64.0
])7.0()2.0[(*])8.0()4.0[(
)7.0*8.0()2.0*4.0(),(
yield? comparison similarity their doesWhat
)7.0,2.0(document Also,
)8.0,4.0(or query vect have Say we
22222
2
DQsim
D
Q
Based on slides by P. Raghavan, H. Schütze, R. Larson
![Page 74: 1 Information Retrieval CSE 454. Administrivia Project & Group Status –How many need group (or have group, but need more teammates?) –Which groups are.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551c4baf5503467b488b500d/html5/thumbnails/74.jpg)
Summary: Why use vector spaces?
• User’s query treated as a (very) short document.
• Query a vector in the same space as the docs.
• Easily measure each doc’s proximity to query.
• Natural measure of scores/ranking – No longer Boolean.
Based on slides by P. Raghavan, H. Schütze, R. Larson