Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms
Author: Monika Henzinger
Presenter: Chao Yan
2
OverviewTwo near-duplicate detecting
algorithms (Broder’s & Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages)
Need to know the pros and cons of each algorithm when they work in different situations.
Need to find a new approach to get better results of detecting near-duplicates
Finding Near-Duplicates in a Large Scale3/28/2013
Finding Near-Duplicates in a Large Scale
3
Relation to course materialDiscuss more details of two algorithms
which were introduced in lecture, and draw important conclusions by comparing the experiment results
Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector.
Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file
3/28/2013
Finding Near-Duplicates in a Large Scale
4
Broder’s AlgorithmBegin with preprocessing HTML
tags and URLs for each document (also used in Charikar)
Use m functions to fingerprint the shingle sequence, and find m minvalues each from the fingerprinted sequence.
3/28/2013
Finding Near-Duplicates in a Large Scale
5
Broder’s AlgorithmDivide the m minvalues into m’
groups, each with l elements e.g. m = 84, m’ = 6, l = 14Concatenate minvalues in each
group to reduce the vector from m entries to m’ entries
Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle)
3/28/2013
Finding Near-Duplicates in a Large Scale
6
B-SimilarityDefinition: The number of
identical entries in the supershingle vectors of two pages
Two pages are near-duplicates iff their B-similarity is at least 2.
e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate
3/28/2013
Finding Near-Duplicates in a Large Scale
7
Charikar’s algorithmExtract a set of features
(meaningful tokens) from a web page, and each feature is tagged with a weight
Each feature (token) is projected to a b-bit vector that each entry in the vector has value {-1, 1}
3/28/2013
Finding Near-Duplicates in a Large Scale
8
Charikar’s algorithmSum up all b-bit projections of
tokens each multiplied by its weight to form a new b-dimensional vector
Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0
3/28/2013
Finding Near-Duplicates in a Large Scale
9
C-SimilarityDefinition: The C-similarity of two
pages is the number of bits their final projections agree on
Two pages are near-duplicates iff the number of agreeing bits in their projections lies above a fixed threshold
e.g. b = 384, threshold = 3723/28/2013
Finding Near-Duplicates in a Large Scale
10
Comparison of two algorithms
3/28/2013
Broder’s algorithm Charikar’s algorithm
Considers order of token sequence
Ignores order of token sequence
Ignores the frequency of shingles
Considers the frequency of terms
O(Tm + Dm’) = O(Tm) O(Tb)
Note: T is the total number of tokens in all web pages. D is the number of web pages.
Finding Near-Duplicates in a Large Scale
11
Comparison of experiment results Construct a similarity graph in
which every page is a node and every edge denotes a near-duplicate pair.
A node is considered a near-duplicate page iff it is incident to at least one edge
3/28/2013
B-similarity graph C-similarity graph
27.4M/1.6B 35.5M/1.6B
Average degree: 135 Average degree: 92
Finding Near-Duplicates in a Large Scale
12
Comparison of experiment results
3/28/2013
B-similarity C-similarity
Distribution of degree in log-log scale
Finding Near-Duplicates in a Large Scale
13
Comparison of experiment results Precision measurement
Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page.
3/28/2013
Broder’s Charikar’s
Total precision 0.38 0.50
Precision on same sites
0.34 0.36
Precision on different sites
0.86 0.90
Finding Near-Duplicates in a Large Scale
14
Comparison of experiment results Term differences in two
algorithms
3/28/2013
Broder’s algorithm Charikar’s algorithm
Average: 24Mean: 11
Average: 94Mean: 7
• 21% with term differences 2
• 90% with term differences less than 42
• 24% with term differences=2
• 90% with term differences less than 44
Finding Near-Duplicates in a Large Scale
15
Comparison of experiment results
Distribution of term differences in two algorithms
3/28/2013
Broder’s algorithm Charikar’s algorithm
Finding Near-Duplicates in a Large Scale
16
Comparison of experiment results Error cases:
3/28/2013
Broder’s case Charikar’s case
NIH database, Herefordshire database on the web
http://www.businessline.co.uk/a UK business directory
Differs in 20 consecutive tokens among 1000-2000 tokens
Differs in 1-5 non consecutive tokens among 1000 tokens
Affected by large amount of boilerplate text
Affected by large amount of common tokens despite of the different order
Charikar’s algorithm works here because it ignores the token order-- the number of different tokens are large enough to be detected
Broder’s algorithm works here because the dispersal of different token generate considerable amount of distinct shingles.
Finding Near-Duplicates in a Large Scale
17
A combined algorithmUse Broder’s algorithm to compute all
B-similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold
The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm
Overall precision improves to 0.79
3/28/2013
Finding Near-Duplicates in a Large Scale
18
ProsExperiment is persuasive and reliable to
conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis
The combined approach includes advantages from both algorithms which can avoid large numbers of false positives.
In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space.
3/28/2013
Finding Near-Duplicates in a Large Scale
19
ConsThe experiment focus on the
precision of the two algorithm, but do not get statistics on the recall.
The combined algorithm has overhead on time complexity, because finding a near-duplicate pair need to run both algorithm.
3/28/2013
Finding Near-Duplicates in a Large Scale
20
ImprovementConsider token order in
Charikar’s algorithm by using shingling;
Consider token frequency in Broder’s algorithm with weighted shingle based on frequency
3/28/2013
Top Related