Information Retrieval (for beginners)
-
Upload
james-melzer -
Category
Technology
-
view
2.297 -
download
4
Transcript of Information Retrieval (for beginners)
![Page 1: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/1.jpg)
Information Retrieval
James Melzer
June 15, 2006
1
![Page 2: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/2.jpg)
How Does Search Work?
2
![Page 3: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/3.jpg)
The basics of search
• A search engine mediates between user’s query and metadata surrogates for documents
• Documents are reduced to metadata
• User’s need is translated into a query
• Query terms are used to find matching metadata terms
• Lots and lots of room for error...
3
![Page 4: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/4.jpg)
The search process
1. Crawl content for metadata
2. Index document terms into an inverted file;an inverted file is very fast to search
3. Search the index to identify the result set;search the index - not the documents
4. Rank the results for display;ranking is the hardest part
4
![Page 5: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/5.jpg)
Search algorithm 1
Term-based Ranking (tf/idf)
• tf = term frequency documents that use the query terms most are presumed to be most relevant
• idf = inverse document frequencyterms that are more rare are better indicators of relevance
• Assumptions1) relevance can be measured with document terms
5
![Page 6: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/6.jpg)
Search algorithm 2
PageRank (Google)
• Relevant set is still identified by term matching
• A revolution in ranking: based on linking between documents
• Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important
6
![Page 7: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/7.jpg)
Citation Analysis
• Authors carefully select articles to cite
• The more citations an article gets, the better it must be
• Citations by authors who have a lot of citations confers their power to those they cite
• Aggregate and leverage all these small individual decisions...
7
![Page 8: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/8.jpg)
How Complex is Google?
8
Google has about 36 ranking algorithms
Examples:
Citation Analysis
Statistical Clustering
Parsing Document Structure
Parsing Data in the Document
Microcontent Parsing
![Page 9: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/9.jpg)
How to Make Search Better?
9
![Page 10: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/10.jpg)
Evaluating Search
Recall
the percentage of all relevant documents retrieved
100% recall means every relevant document is retrieved
Precision
the percentage of documents retrieved that are relevant
100% precision means only relevant documents are retrieved
10
![Page 11: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/11.jpg)
Thoughts & Reservations about Evaluating Search
• Precision and Recall are usually inversely proportional, so improving one often reduces the other.
• Given a corpus of content like the web (tens of billions of items)...Recall is unmeasurable, and thus essentially meaningless
• What is relevance?
• Measuring Precision depends on an agreed definition of relevance, which is tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard to quantify)
![Page 12: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/12.jpg)
Best Bets
• Manually selected results, tied to specific query terms or phrases
• User-driven phrasesselect the most-used phrases from search traffic;go for easy wins, because returns diminish sharply
• Business-driven phrasesselect phrases important to the business;such as product names or office locations;or politically sensitive phrases, so you can control the message people see
Zipf
12
![Page 13: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/13.jpg)
Relevance Feedback
• The user provides direct or indirect feedback on the search results
• Click tracking
• “More like this” or “Find similar”
• Clustering
13
![Page 14: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/14.jpg)
Structured Search
• Designers use patterns in search behavior to guess user’s intent;this requires a substantial understanding of user behavior;it may require structured content (although, not necessarily)
Examples
• Zip Code -> Zip Code Lookup Tool
• Person’s name -> Directory Listing
• Product Name -> Shop or Support?
• Address -> Map this?
• Topic -> Introduction, Forms, Policies or Reports?
14
![Page 15: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/15.jpg)
Controlled Vocabularies
• Classification with a controlled vocabulary is the best way to ensure 100% Recall
• Lead-in synonymsenter “fridge”; get “refrigerator” instead;best if the collection is well-catalogedincreases precision (e.g. in a library)
• Term-expansion synonyms;enter “refrigerator”; get “fridge” too;best if the collection is not well-catalogedincreases recall at the cost of precision (e.g on eBay)
• Spell check on query phrases
15
![Page 16: Information Retrieval (for beginners)](https://reader033.fdocuments.us/reader033/viewer/2022052904/557cd11fd8b42a7e5b8b5209/html5/thumbnails/16.jpg)
Why is search important?
IF: About half of all users prefer to search first*
THEN:What percentage of a content site’s development effort should be devoted to search?
16
* This statistic is highly context-dependent. People’s behavior depends on the context of their actions. The stat is from Jared Spool.