Download - Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

Author: Monika Henzinger

Presenter: Chao Yan

2

OverviewTwo near-duplicate detecting

algorithms (Broder’s & Charikar’s algorithm) are compared on a very large scale (1.6 billion distinct web pages)

Need to know the pros and cons of each algorithm when they work in different situations.

Need to find a new approach to get better results of detecting near-duplicates

Finding Near-Duplicates in a Large Scale3/28/2013

Finding Near-Duplicates in a Large Scale

3

Relation to course materialDiscuss more details of two algorithms

which were introduced in lecture, and draw important conclusions by comparing the experiment results

Broder’s algorithm is basically a minhashing algorithm discussed in lecture. The paper goes further to calculate a supershingle based on the minvalue vector.

Both algorithms obey the general paradigm of finding near-duplicates, which is to generate and compare signature of each file

3/28/2013


4

Broder’s AlgorithmBegin with preprocessing HTML

tags and URLs for each document (also used in Charikar)

Use m functions to fingerprint the shingle sequence, and find m minvalues each from the fingerprinted sequence.

3/28/2013


5

Broder’s AlgorithmDivide the m minvalues into m’

groups, each with l elements e.g. m = 84, m’ = 6, l = 14Concatenate minvalues in each

group to reduce the vector from m entries to m’ entries

Fingerprint each of the m’ entries to generate an m’-dimensional vector (supershingle)

3/28/2013


6

B-SimilarityDefinition: The number of

identical entries in the supershingle vectors of two pages

Two pages are near-duplicates iff their B-similarity is at least 2.

e.g. m’ = 6, pairs with more than 2 entry agrees are near-duplicate

3/28/2013


7

Charikar’s algorithmExtract a set of features

(meaningful tokens) from a web page, and each feature is tagged with a weight

Each feature (token) is projected to a b-bit vector that each entry in the vector has value {-1, 1}

3/28/2013


8

Charikar’s algorithmSum up all b-bit projections of

tokens each multiplied by its weight to form a new b-dimensional vector

Generate the final b-dimensional vector by setting the positive entry to 1 and non-positive entry to 0

3/28/2013


9

C-SimilarityDefinition: The C-similarity of two

pages is the number of bits their final projections agree on

Two pages are near-duplicates iff the number of agreeing bits in their projections lies above a fixed threshold

e.g. b = 384, threshold = 3723/28/2013


10

Comparison of two algorithms

3/28/2013

Broder’s algorithm Charikar’s algorithm

Considers order of token sequence

Ignores order of token sequence

Ignores the frequency of shingles

Considers the frequency of terms

O(Tm + Dm’) = O(Tm) O(Tb)

Note: T is the total number of tokens in all web pages. D is the number of web pages.


11

Comparison of experiment results Construct a similarity graph in

which every page is a node and every edge denotes a near-duplicate pair.

A node is considered a near-duplicate page iff it is incident to at least one edge

3/28/2013

B-similarity graph C-similarity graph

27.4M/1.6B 35.5M/1.6B

Average degree: 135 Average degree: 92


12

Comparison of experiment results

3/28/2013

B-similarity C-similarity

Distribution of degree in log-log scale


13

Comparison of experiment results Precision measurement

Precision of results from same sites is low because very often pages on the same site use the same boilerplate text and differ only in the main item in the center of the page.

3/28/2013

Broder’s Charikar’s

Total precision 0.38 0.50

Precision on same sites

0.34 0.36

Precision on different sites

0.86 0.90


14

Comparison of experiment results Term differences in two

algorithms

3/28/2013


Average: 24Mean: 11

Average: 94Mean: 7

• 21% with term differences 2

• 90% with term differences less than 42

• 24% with term differences=2

• 90% with term differences less than 44


15

Comparison of experiment results

Distribution of term differences in two algorithms

3/28/2013



16

Comparison of experiment results Error cases:

3/28/2013

Broder’s case Charikar’s case

NIH database, Herefordshire database on the web

http://www.businessline.co.uk/a UK business directory

Differs in 20 consecutive tokens among 1000-2000 tokens

Differs in 1-5 non consecutive tokens among 1000 tokens

Affected by large amount of boilerplate text

Affected by large amount of common tokens despite of the different order

Charikar’s algorithm works here because it ignores the token order-- the number of different tokens are large enough to be detected

Broder’s algorithm works here because the dispersal of different token generate considerable amount of distinct shingles.

http://www.businessline.co.uk/

http://www.businessline.co.uk/


17

A combined algorithmUse Broder’s algorithm to compute all

B-similar pairs first. Then use Charikar’s algorithm to filter out those pairs whose C-similarity falls below a certain threshold

The reason: false positives for Broder’s algorithm (consecutive term differences with large boilerplate text) can be filtered by Charikar’s algorithm

Overall precision improves to 0.79

3/28/2013


18

ProsExperiment is persuasive and reliable to

conclude the pros and cons of the two algorithms. e.g. large data samples, human evaluation, error case analysis

The combined approach includes advantages from both algorithms which can avoid large numbers of false positives.

In the combined approach, Charikar’s algorithm is computed on the fly, which saves much space.

3/28/2013


19

ConsThe experiment focus on the

precision of the two algorithm, but do not get statistics on the recall.

The combined algorithm has overhead on time complexity, because finding a near-duplicate pair need to run both algorithm.

3/28/2013


20

ImprovementConsider token order in

Charikar’s algorithm by using shingling;

Consider token frequency in Broder’s algorithm with weighted shingle based on frequency

3/28/2013