Search Engines Session 5 INST 301 Introduction to Information Science.
-
Upload
aubrie-mcbride -
Category
Documents
-
view
214 -
download
0
description
Transcript of Search Engines Session 5 INST 301 Introduction to Information Science.
![Page 1: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/1.jpg)
Search Engines
Session 5INST 301
Introduction to Information Science
![Page 2: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/2.jpg)
Washington Post (2007)
![Page 3: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/3.jpg)
so what is a
Search Engine?
![Page 4: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/4.jpg)
![Page 5: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/5.jpg)
![Page 6: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/6.jpg)
the cat food
cats eat canned food. the cat food is not good for dogs.
Query
D1Natural
organic cat food available at petco.com
D2
![Page 7: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/7.jpg)
Find all the brown boxesNo Structure and No Index
![Page 8: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/8.jpg)
How about here • This is what indexing does
• Makes data accessible in a structured format, easily accessible through search.
![Page 9: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/9.jpg)
Term – Document Index Matrix
1: cats eat canned food. the cat food is not good for dogs. 2: natural organic cat food available at petco.com
Documents:
TERM D1 D2
Building Index
available 0 1canned 1 0
cat 2 1dog 1 0eat ? ?
food ? ?… … …
![Page 10: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/10.jpg)
the cat food
cats eat canned food. the cat food is not good for dogs.
Query
D1the the
the the the the
D3
Some terms are more informative than others
Natural organic cat
food available at petco.com
D2
![Page 11: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/11.jpg)
How Specific is a Term?
TERM (t) Document Frequency of term t (dft )
Inverse Document Frequency of term t
(idft) = (N/dft )
Log of Inverse Document Frequency of term t [log(idft)]
cat 1 1,000,000
petco.com 100 10,000
food 1000 1000
canned 10,000 100
good 100,000 10
the 1,000,000 1
![Page 12: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/12.jpg)
How Specific is a Term?
TERM (t) Document Frequency of term t (dft )
Inverse Document Frequency of term t
(idft) = (N/dft )
Log of Inverse Document Frequency of term t [log(idft)]
cat 1 1,000,000
petco.com 100 10,000
food 1000 1000
canned 10,000 100
good 100,000 10
the 1,000,000 1
Magnitude of increase
![Page 13: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/13.jpg)
TERM (t) Document Frequency of term t (dft )
Inverse Document Frequency of term t
(idft) = (N/dft )
Log of Inverse Document Frequency of term t [log(idft)]
cat 1 1,000,000 6
petco.com 100 10,000 4
food 1000 1000 3
canned 10,000 100 2
good 100,000 10 1
the 1,000,000 1 0
How Specific is a Term?
![Page 14: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/14.jpg)
Putting it all together
• To rank, we obtain the weight for each term using tf-idf
• The tf-idf weight of a term is the product of its tf weight and its idf weight
Weight (t) = tft × log(N /dft)
• Using the term weights, we obtain the document weight
![Page 15: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/15.jpg)
![Page 16: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/16.jpg)
• A type of “document expansion”– Terms near links describe content of the target
• Works even when you can’t index content– Image retrieval, uncrawled links, …
Finding based on MetaData or Description
![Page 17: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/17.jpg)
Ways of Finding Information• Searching content
– Characterize documents by the words the contain
• Searching behavior– Find similar search patterns– Find items that cause similar reactions
• Searching description– Anchor text
![Page 18: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/18.jpg)
Crawling the Web
![Page 19: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/19.jpg)
Web Crawl Challenges• Adversary behavior
– “Crawler traps”
• Duplicate and near-duplicate content– 30-40% of total content– Check if the content is already index– Skip document that do not provide new information
• Network instability– Temporary server interruptions– Server and network loads
• Dynamic content generation
![Page 20: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/20.jpg)
How does Google PageRank work? Objective - estimate the importance of a webpage
• Inlinks are “good” (like recommendations)
P1
Px
Py
P2
Pa
Pk
PiPj
• Inlinks from a “good” site are better than inlinks from a “bad” site
![Page 21: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/21.jpg)
Link Structure of the WebNature 405, 113 (11 May 2000) | doi:10.1038/35012155
![Page 22: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/22.jpg)
So, A Web search engine is an application composed of ;
SEARCH component
INDEXING component - of importance to developers AND content-centric
- of importance to the users AND user-centric
CRAWLING component - important to define a search space
![Page 23: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/23.jpg)
Today: The “Search Engine”
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
![Page 24: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/24.jpg)
Next Session: “The Search”
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
![Page 25: Search Engines Session 5 INST 301 Introduction to Information Science.](https://reader035.fdocuments.us/reader035/viewer/2022062911/5a4d1bd47f8b9ab0599d98fa/html5/thumbnails/25.jpg)
Before You Go
• Assignment H2
On a sheet of paper, answer the following (ungraded) question (no names, please):
What was the muddiest point in today’s class?