IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems
description
Transcript of IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems
![Page 1: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/1.jpg)
IS530 Lesson 12
Boolean vs. Statistical Retrieval Systems
![Page 2: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/2.jpg)
Boolean or Statistical?
Most web search engines default to statistical, use Boolean for advanced
Most proprietary online systems default to Boolean, use statistical for alternative
Statistical search engine vs. relevance ranking of Boolean results
![Page 3: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/3.jpg)
Web Search Engines
Databases generated by robotic programs
(non-human)
spiders, wanderers, web walkers, agents
Full-text indexing of website contents
Supports advanced, complex search
strategies
![Page 4: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/4.jpg)
3 Parts of a Web Search Engine
1. Spider or web-crawler reads webpage, follows links
2. Index catalogs webpages read by spider
3. Search engine software matches queries
lists most relevant site first
![Page 5: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/5.jpg)
3 Parts of an Online System
1) Database building software (dataware)
(follows rules with known fields)2)Index/dictionary file(list of all words and sometimes
phrases in the indexed fields)3) Search engine software(matches queries; Boolean or
statistical; LIFO or relevant
![Page 6: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/6.jpg)
Boolean Operators
AND limits search decreases hits increases precision
OR expands search increases precision decreases hits
NOT limits search seldom used too strong
Proximity Operators Adj, (N)ear, (W)ith
limit a search increase precision
![Page 7: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/7.jpg)
Command Interface Boolean Searching (Westlaw)
Find information about the assumption of risk involving people who fall after slipping in wintery conditions.
assum! /5 risk / p (ic* or snow****) /p (slip! or fell or fall***)
![Page 8: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/8.jpg)
Natural Language and Relevance Ranking (WIN) I need information on
assumption of risk involving a person who has fallen on ice or snow.
![Page 9: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/9.jpg)
Non-Boolean Retrieval Systems
Statistical (associative, probabilistic, or relevance systems)
Linguistic (semantic)
![Page 10: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/10.jpg)
Statistical Retrieval Systems
Incorporate relevance ranking
May incorporate relevance feedback
May have natural language interface
Almost all web search engines use
![Page 11: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/11.jpg)
Algorithm
Latin algorismus, after al-KhwArizmi
Arabian mathematician (AD 825)
Step-by-step procedure for solving
mathematical problems Merriam-Webster http://www.m-w.com/
Statistical search engines use weighting
algorithms to compute relevance
![Page 12: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/12.jpg)
Statistical Search Engines
Weighting algorithms are proprietary
Search engines differ in how they assign
weights and compute relevance ranking
Search results differ
studies found only about 40% overlap
![Page 13: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/13.jpg)
Statistical Web Retrieval Factors
Popularity, # other sites that link to a site authoritative sites given heavier weight
Meta-tags may boost ranking Inktomi/Overture
Direct hit may boost ranking HotBot
![Page 14: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/14.jpg)
Linguistic Retrieval System
Natural Language & Relevance
Ranking
WIN - (Westlaw Is Natural) has some elements
I need information on assumption of risk
involving a person who has fallen on ice or
snow.
![Page 15: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/15.jpg)
WIN Steps
1. Enter query in plain English
2. System removes stop phrases
3. Matches legal phrases from thesaurus,
adjusts weighting
4. Removes stop words
![Page 16: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/16.jpg)
WIN Steps (cont.)
5. Stemming
6. Searches database indexes in OR
relationship
7. Statistical comparison applied
8. Results placed in ranked order
![Page 17: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/17.jpg)
Factors in Determining Relevance
Proximity of query words to each other
Position of query words keywords in title rank higher keyword in headline or near top
Relative length of document
(“normalization”)
Stemming
![Page 18: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/18.jpg)
Factors in Determining Relevance (cont.)
Ignore very frequent terms
Inverse term frequency
Relevance feedback
Stop words
Query expansion/thesaurus
![Page 19: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/19.jpg)
Features Users Can Control
Designating “bound phrases”
Flagging terms that must be present*
Specifying truncat?
Indicating (synonym groups)
Synonym dictionaries
![Page 20: IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems](https://reader036.fdocuments.us/reader036/viewer/2022083006/56813df3550346895da7cf79/html5/thumbnails/20.jpg)
Web Sites that list search engines and features:
www.pandia.comwww.searchenginewatch.comhttp://notess.com