Post on 23-Sep-2020
Search Engines and Google
Francisco Velázquez3. Nov. 2010
1
Motivation
• Human maintained lists are subjective, expensive to build and maintain, slow to improve and cannot cover all esoteric topics.
• Automated search engines that rely on keyword matching return low quality matches.
• Advertisers mislead automated search engines.
• Scalability in search engines must meet WWW growth.
2
Content• 3 Tier Framework
• Components of a search engine
• Crawler
• PageRank
• Indices
• Map-Reduce Parallelism Framework
• Finding Similar Pages
• Jaccard Measure of Similarity
• Minhashing
• Locality-Sensitive Hashing
3
The components of a search engine
5
Crawler
• A process that downloads web pages to a Page Repository.
• Examine pages for links to other pages and insert the ones that are not in the Page Repository in the set for pages to be crawled. http://goo.gl/gG3s
6
CrawlerChallenge Description Solution
Terminating search Dynamically generated pages could create a forever loop
Limit number of pages to crawl with a “depth” limit per site
Managing the repository
1. Duplication of URL to be crawled
2. Duplicated pages due to mirror sites, different routes, plagiarism, etc.
1. An efficient index for checking stored pages
2. Minhash and locality-sensitive hashing signatures
Selecting the next page How to prioritise next page to be crawled? Give priority to “important” pages
Speeding up the crawl
1. How many processes should be simultaneously run?
2. How to synchronise them to avoid they crawl the same site.
3. Avoid DoS attack
1. Scale to several machines2. Assign processes to entire hosts
or sites3. Do not issue frequent requests
to a single site. Several processes in a single machine due to idle states.
7
Query Processing in Search Engines
• Search engine queries are not like SQL queries
• Require inverted indices
• Disk access is very expensive to offer the user acceptable response time
• Matched records are ranked before showing to the user
8
PageRank• Algorithm for identifying
“important” pages
• A Web page is important if many important pages link to it
http://goo.gl/gKsQ
http://goo.gl/CsuN
9
Recursive Formulation of Page Rank
Yahoo!
Amazon Microsoft
The Web in 1839
Transition Matrix
1/2 1/2 0M = 1/2 0 1
0 1/2 0
Yaho
o!
Am
azon
Mic
roso
ft
Amazon
Yahoo!
Microsoft
The Matrix M, the transition matrix of the Web has element rank r, mij in row i and column j, where
1.mij = 1/r if page j has a link to page i, and there are a total of r≥1 pages that j links to
2.mij = 0 otherwise
10
Suppose y, a, and m represent PageRanks and fractions of the time the random walker spends
y 1/2 1/2 0 y
a = 1/2 0 1 a
m 0 1/2 0 m
2/6 1/2 1/2 0 1/3
3/6 = 1/2 0 1 1/3
1/6 0 1/2 0 1/3
5/12 1/2 1/2 0 2/6
4/12 = 1/2 0 1 3/6
3/12 0 1/2 0 1/6
After repeating the process several times:
9/24 20/48 2/5
11/24 , 17/48 , … , 2/5
4/24 11/48 1/5
Yahoo!
Amazon
Microsoft
Suggested since the probability of y+a+m=1
11
Spider Traps and Dead Ends
Microsoft becomes a spider trap
Yahoo!
Amazon Microsoft
Yahoo!
Amazon Microsoft
Microsoft becomes a dead end
0
0
1
Yahoo!
Amazon
Microsoft
0
0
0
Yahoo!
Amazon
Microsoft
12
Spider traps and dead ends solution
• Limit the time that random walker is allowed to wander at random
• Pick a constant β<1, typically in the range 0.8 to 0.9.
• Taxation rate: 1-β
• If the walker gets stuck in a spider trap, it will disappear and be replace by a new walker after few time steps
• If the walker reaches a dead end and disappears, a new walker will take over shortly
13
1/2 1/2 0 1/3
Pnew = 0.8 1/2 0 0 Pold + 0.2 1/3
0 1/2 1 1/3
Yahoo!
Amazon Microsoft
Microsoft becomes a spider trap
7/33
5/33
21/33
After several iterations Yahoo!
Amazon
Microsoft
14
Teleport Sets
• Selected set of nodes
• Eliminate spam and pages that don’t concern to the search topic
• Nodes are selected from trusted open directories, keywords in pages on a topic, users’s bookmarks, recently searched keywords, etc.
15
Yahoo!
Amazon Microsoft
The Web in 1839
y 1/2 1/2 0 y 0
a = 0.8 1/2 0 1 a + 0.2 1
m 0 1/2 0 m 0
10/31
15/31
6/31
After several iterations Yahoo!
Amazon
Microsoft
Pnew = β M Pold + (1-β)t
16
Link Spam
• Spam farming in order to accumulate and concentrate PageRank on a few pages
• Links to the spam farm from pulicly accessible blogs, with messages like “I agree with you. See x1234.mySpam.Farm.com”
S
…
…
Links from outside
17
Link Spam Solution
• Compute the TrustRank of pages
• TrustRank: Topic-specific PageRank computed with a Teleport set consisting of only “trusted” pages
• Manual trusted pages collection
• User Teleports with sets of serious pages such as universities
• Compute the difference between the PageRank and TrustRank for each page. This difference is the negative TrustRank
18
Indices
Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2Documents with ids 0,1,2
0 1 2
the cat is fat
was raining cats and dogs
Fido the dog
Inverted IndexInverted Index
and 1
cat 0, 1
dog 1, 2
fat 0
fido 2
is 0
raining 1
the 0, 2
was 1
19
Inverted Indices
• Essential for Web Queries
• Uses indirect buckets for space efficiency
Buckets
cat
dog
Inverted Index
... the cat is fat ...
... was raining cats and dogs ...
... Fido the dog ...
Documents
20
Sorting more information in the inverted index
Type Position Document
title 5
header 10
anchor 3
text 57
title 100
title 12
Doc 1
Doc 2
Doc 3
Cat
Dog
Dogs compared with cats
21
Map-Reduce Parallelism Framework
• Large-scale parallel machines share high load operations such as joins
• Distributed architectures
• Grid, networks and corporate DBs
• MRP paradigm expresses large-scale computations
Map Reduce
InputKey-Value
Pairs
OutputLists
Sort IntermediateKey-Value
Pairs by Keys
Execution of map and reduce functions
22
Jaccard Measure of Similarity
• Finding Similar Items
• Jaccard similarity is the radio of the sizes of interaction and union the sets S and T.
|S⋂T|/|S⋃T|
{1,2,3} and {1,3,4,5} has radio 2/5
• A set of k-grams or k-Shingle is a substring of length k of a set.
“A number of …” “A n”, “ nu”, “num”, and so on.
23
Minhashing
• It is a technique to form a short signature for each set
• Computes the Jaccard similarity using signatures
• A minhash value of a set S is the first element of a randomly permuted universal set, that is a member of S
• Universal set of elements is {1,2,3,4,5} and a permuted order is: (3,5,4,2,1). Then, the hash value for the set {2,3,5} is 3.
24
Locality-Sensitive Hashing (LSH)
• Minhashing is fast but there are still too many pairs of sets
• LSH hashes sets to buckets so that “similar” elements are assigned to the same bucket
• Tradeoffs number of buckets (constrained by memory) and chances to miss a pair of similar elements
25
n signatures
r rows
r bands
Buckets
Dividing signatures into bands and hashing based on the values in a band
s = (1/b) 1/r
Probability of at least one bucket in common
Similarity s
1
1
0
0
The probability that a pair of signatures will appear together in at least one bucket
26
Combining Minhashing and LSH
1. Compute minhash signature with as many hash functions as desired accuracy
2. Perform LSH to get candidate pairs of signatures that hash to the same bucket for at least one band
3. For each candidate pair, compute the estimate of their Jaccard similarity by counting the number of components in which their signature agree
4. Optionally, for each pair whose signatures are sufficiently similar, compute their true Jaccard similarity by examining the sets themselves
27
Google Apps28
Anatomy of a Google Search
• Uses: links, PageRank, anchors, proximity and visual presentation (e.g. bold text is weighted higher) in search logic. Search the index
1. Search the index
2. Analyze the web pages for relevance
3. Evaluate the site’s reputation
4. Rank the web pages
29
Google’s System Anatomy
http://goo.gl/yYbb30
Google particularities
• PageRank
• Anchor text
• Location information and use of proximity in search
• Visual presentations such as font, capitalization and size of words are weighted differently
31
References
• The Anatomy of a Large-Scale Hypertextual Web Search Engine
http://infolab.stanford.edu/~backrub/google.html
• Database Systems. The Complete Book. Second Edition. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom
32