Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
-
Upload
owen-blankenship -
Category
Documents
-
view
212 -
download
0
Transcript of Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
![Page 1: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/1.jpg)
Modern Information Retrieval: A Brief Overview
ByAmit Singhal
Ranjan Dash
![Page 2: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/2.jpg)
Layout History Models & Implementations Evaluation Key Techniques
Term Weighting Query Modification
Other Techniques and Applications Conclusion
![Page 3: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/3.jpg)
History Starts from 3000BC with Sumerians The major IR developments starts in 1950s and 1960s 1950s – Vannevar Bush, Luhn 1960s –
SMART system – Gerald Salton Cranfield Evaluation – Cyril Cleverdon
1970s & 1980s – Various models for document retrieval on small text collection
1992 TREC – Text Retrieval Conference Other fields like retrieval of spoken information, non-English
language retrieval, info filtering, Modern Textual IR – WWW search 1996 - 1998
![Page 4: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/4.jpg)
Models & Implementations IR systems
Boolean systems Ranked Retrieval Systems
Models Vector space model Probabilistic Model Inference Network Model
Implementation
![Page 5: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/5.jpg)
Models & Implementations..
Vector space model Every word in vocabulary as independent dimension Document or query as vectors in this high
dimensional space Positive quadrant of vector space Numeric similarity between query vector and
document vector – cosine of the angle between them.
![Page 6: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/6.jpg)
Models & Implementations..
Probabilistic Model – Probabilistic Ranking Principle(PRP) Ranked by decreasing probability of their relevance to a query Maron and Kuhn - 1960 Probability of relevance for doc D
P(R|D)= = =
![Page 7: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/7.jpg)
Models & Implementations..
Assumptions:
![Page 8: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/8.jpg)
Inference Network Model Inference process in an inference network A document instantiates a term with a certain strength
and credit from multiple terms is accumulated Strength of instantiation of a term – weight Document ranking for this model = Vector space or
probabilistic models
Models & Implementations..
![Page 9: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/9.jpg)
Models & Implementations..
Implementation Inverted list Stop words
Stemming – little effective for English, effective for language with many word inflections – GermanMultiword phrasesTechniques to generate list of phrases – linguistic, statistical
![Page 10: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/10.jpg)
Evaluation Objective evaluation Cranfield Tests Characteristics for search effectiveness –
Recall – proportion of relevant documents retrieved by the system
Precision – proportion of the retrieved documents that are relevant
Average Precision – averaging precisions at different recall points
![Page 11: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/11.jpg)
Key Techniques Term weight
Term frequency – Raw tf – non optimal Dampened tf ( logarithmic tf) –
better one Okapi weighting
Pivoted normalization weighting Document frequency Document length
Query modification/expansion via relevance feedback
![Page 12: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/12.jpg)
Key Techniques Query modification/expansion Adding synonyms – lack of query context Relevance feedback – Rocchio in 1965
User judgment to modify the query Quite effective
Pseudo-feedback for short user query Top few docs retrieved by initial user query are ‘relevant’ and
does relevance feedback to generate a new query
![Page 13: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/13.jpg)
Other Techniques and Applications
Cluster Hypothesis – Documents that cluster together have similar relevance profile for a query
Natural Language Processing ( NLP ) – Not so effective for IR
Other IR fields besides doc ranking Information Filtering (IF), Topic Detection and
Tracking ( TDT), Speech Retrieval, Cross-language retrieval
![Page 14: Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.](https://reader035.fdocuments.us/reader035/viewer/2022072014/56649eaa5503460f94baef83/html5/thumbnails/14.jpg)
Conclusion 40 yrs of experience for IR Statistical techniques are the BEST