Document sketching

Document sketching• Problem: duplicate or near-duplicate identification in a collection of

documents• How to measure the similarity between documents? • A reasonable (?) candidate: edit distance• Computationally expensive

• Another measure: resemblance due to [Broder ‘97]

Resemblance of documents [Broder ‘97]• : resemblance between documents and • . Similar means close to • Convert documents to a set of integers

• A contiguous sequence of length contained in document is called a -shingle • Example: (a rose is a rose is a rose)• -shingles of are: (a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose)• The set of -shingles of : {(a rose is a), (rose is a rose), (is a rose is)}• Map shingles to integers (for some fixed )• From now on, identify the documents with sets of integers in • Thus a document is represented as a set of integers • (also known as Jaccard similarity between sets and )

• Thus, , but does not mean • In practice, is a reasonable approximation of the informal notion of similarity of

Estimating resemblance• Given : • Estimate: • Exact computation of requires time

• A basic estimator for • : set of permutations • Choose a random

• Variance too high

Reducing varianceFirst method• Sample random permutations • Sketch of document is • Resemblance can be estimated as

(this is an unbiased estimator; proof follows from the previous slide)

Second method• Let denote the set of smallest elements of , and if , then • For a constant , and uniformly random

is an unbiased estimator of (details on the board)

• We can estimate within multiplicative error with (for both methods above)

• The second method above gives us a way of sketching the documents: Fix a permutation , and a constant For document , its sketch is

• Now given the sketches of documents , using the same permutation , we can estimate the resemblance of pairs

• Sketch of a document takes space and estimating resemblance takes time

• (We can also do it with the second method but we will need to store permutations)

Document sketching in small space• One problem with this: storing permutations is expensive

• Question: Can we work with a small set of permutations instead of ?

• Yes: Min-wise independent permutations [Broder et al. ‘98]• Can also use 2-wise independent hash functions [Thorup 2013]

Sampling from data streams

Sampling from a data stream• How to select a uniformly random size subset of ?• Choose the th element with probability if elements have already been

selected

• What if the set is given via a stream and we don’t know its length in advance?

• There is a solution similar to the previous one, but the following is easier:• For the th item, sample uniformly at random, keep the items with highest

value of the

Sampling for subset sum estimation• Given a stream of positive weights, we want to keep a small amount of

information so that later we can estimate the weight of any given subset (the weight of a subset is the sum of the weights in it)

First Solution (Poisson sampling)• Choose any probabilities for each weight• On encountering , include it in set with probability (independent of

previous decisions) • Given any set (chosen in advance before the selection of ), estimator for • This is an unbiased estimator for • The expected number of samples

Poisson sampling• Smaller sample set does not come for free: Variance in the estimate of the weight of the th item

• One issue with this solution: The sample size is not fixed (although it can be concentrated around the mean)

• Another issue: What should be the values of the ? If we want the sample to be size in expectation, then a possible choice is

• But – may not be known– this sampling is not weight-sensitive: we may want to choose to be

larger for larger to reduce the variance

Priority sampling [Duffield et al. 2007]Second solution (priority sampling):• For each item, generate an independent uniform • Priority of item is given by • We assume all priorities are distinct (true with probability )• For a given the priority sample of size is given by the items of highest

priority • th priority, thus iff • For let if and otherwise

Properties of priority sampling:• Maintains sample of fixed size • For • And so, for T (proof on the board; also in the Duffield et al. paper)

Priority sampling propertiesWe won’t prove the following:• For distinct , and have covariance• So the variance of the estimate of the weight of a set is the sum of the

variances of the estimators for the items in the set

• The total variance (sum of variances of the estimators of all individual items) of priority sampling is near-minimal among unbiased estimators

Document sketching

Documents

Transcript of Document sketching