Information Retrieval
description
Transcript of Information Retrieval
![Page 1: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/1.jpg)
(C) 2003, The University of Michigan 1
Information Retrieval
Handout #2
February 3, 2003
![Page 2: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/2.jpg)
(C) 2003, The University of Michigan 2
Course Information
• Instructor: Dragomir R. Radev ([email protected])
• Office: 3080, West Hall Connector
• Phone: (734) 615-5225
• Office hours: M&F 11-12
• Course page: http://tangra.si.umich.edu/~radev/650/
• Class meets on Mondays, 1-4 PM in 409 West Hall
![Page 3: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/3.jpg)
(C) 2003, The University of Michigan 3
Queries and documents
![Page 4: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/4.jpg)
(C) 2003, The University of Michigan 4
Queries
• Single-word queries
• Context queries– Phrases– Proximity
• Boolean queries
• Natural Language queries
![Page 5: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/5.jpg)
(C) 2003, The University of Michigan 5
Pattern matching
• Words, prefixes, suffixes, substrings, ranges, regular expressions
• Structured queries (e.g., XML)
![Page 6: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/6.jpg)
(C) 2003, The University of Michigan 6
Relevance feedback
• Query expansion
• Term reweighting
• Pseudo-relevance feedback
• Latent semantic indexing
• Distributional clustering
![Page 7: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/7.jpg)
(C) 2003, The University of Michigan 7
Document processing
• Lexical analysis
• Stopword elimination
• Stemming
• Index term identification
• Thesauri
![Page 8: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/8.jpg)
(C) 2003, The University of Michigan 8
Porter’s algorithm• 1. The measure, m, of a stem is a function of sequences of vowels
followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
![Page 9: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/9.jpg)
(C) 2003, The University of Michigan 9
Porter’s algorithm• Suffix conditions take the form current_suffix = = pattern
Actions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:
STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control
![Page 10: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/10.jpg)
(C) 2003, The University of Michigan 10
Porter’s algorithm
Example: the word “duplicatable”
duplicat rule 4duplicate rule 1b1duplic rule 3
The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.
![Page 11: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/11.jpg)
(C) 2003, The University of Michigan 11
Porter’s algorithm
Computable Comput
Intervention Intervent
Retrieval Retriev
Document Docum
Representing Repres
Representative Repres
![Page 12: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/12.jpg)
(C) 2003, The University of Michigan 12
Relevance feedback
• Automatic
• Manual
• Method: identifying feedback termsQ’ = a1Q + a2R - a3N
Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
![Page 13: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/13.jpg)
(C) 2003, The University of Michigan 13
Example
• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -
relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-
relevant• R = ?• S = ?• Q’ = ?
![Page 14: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/14.jpg)
(C) 2003, The University of Michigan 14
Automatic query expansion
• Thesaurus-based expansion
• Distributional similarity-based expansion
![Page 15: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/15.jpg)
(C) 2003, The University of Michigan 15
WordNet and DistSim
wn reason -hypen - hypernyms
wn reason -synsn - synsets
wn reason -simsn - synonyms
wn reason -over - overview of senses
wn reason -famln - familiarity/polysemy
wn reason -grepn - compound nouns
/clair3/tools/relatedwords/relate reason
![Page 16: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/16.jpg)
(C) 2003, The University of Michigan 16
Related (substitutable) words
Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint
Distributional clustering:
Wordnet
Book: autobiography, essay, biography, memoirs, novelsComputer: adobe, computing, computers, developed, hardwareFruit: leafy, canned, fruits, flowers, grapesPolitician: activist, campaigner, politicians, intellectuals, journalistNewspaper: daily, globe, newspapers, newsday, paper
![Page 17: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/17.jpg)
(C) 2003, The University of Michigan 17
Indexing and searching
![Page 18: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/18.jpg)
(C) 2003, The University of Michigan 18
Computing term salience
• Term frequency (IDF)
• Document frequency (DF)
• Inverse document frequency (IDF)
N
wDFwIDF
)(log)(
![Page 19: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/19.jpg)
(C) 2003, The University of Michigan 19
Scripts to compute tf and idf
cd /clair4/class/ir-w03/hw2
./tf.pl 053.txt | sort -nr +1 | more
./tfs.pl 053.txt | sort -nr +1 | more
./stem.pl reasonableness
./build-idf.pl
./idf.pl | sort -n +2 | more
![Page 20: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813bee550346895da51d24/html5/thumbnails/20.jpg)
(C) 2003, The University of Michigan 20
Applications of TFIDF
• Cosine similarity
• Indexing
• Clustering