Document sketching
description
Transcript of Document sketching
Document sketching• Problem: duplicate or near-duplicate identification in a collection of
documents• How to measure the similarity between documents? • A reasonable (?) candidate: edit distance• Computationally expensive
• Another measure: resemblance due to [Broder ‘97]
Resemblance of documents [Broder ‘97]• : resemblance between documents and • . Similar means close to • Convert documents to a set of integers
• A contiguous sequence of length contained in document is called a -shingle • Example: (a rose is a rose is a rose)• -shingles of are: (a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose)• The set of -shingles of : {(a rose is a), (rose is a rose), (is a rose is)}• Map shingles to integers (for some fixed )• From now on, identify the documents with sets of integers in • Thus a document is represented as a set of integers • (also known as Jaccard similarity between sets and )
• Thus, , but does not mean • In practice, is a reasonable approximation of the informal notion of similarity of
Estimating resemblance• Given : • Estimate: • Exact computation of requires time
• A basic estimator for • : set of permutations • Choose a random
• Variance too high
Reducing varianceFirst method• Sample random permutations • Sketch of document is • Resemblance can be estimated as
(this is an unbiased estimator; proof follows from the previous slide)
Second method• Let denote the set of smallest elements of , and if , then • For a constant , and uniformly random
is an unbiased estimator of (details on the board)
• We can estimate within multiplicative error with (for both methods above)
• The second method above gives us a way of sketching the documents: Fix a permutation , and a constant For document , its sketch is
• Now given the sketches of documents , using the same permutation , we can estimate the resemblance of pairs
• Sketch of a document takes space and estimating resemblance takes time
• (We can also do it with the second method but we will need to store permutations)
Document sketching in small space• One problem with this: storing permutations is expensive
• Question: Can we work with a small set of permutations instead of ?
• Yes: Min-wise independent permutations [Broder et al. ‘98]• Can also use 2-wise independent hash functions [Thorup 2013]
Sampling from data streams
Sampling from a data stream• How to select a uniformly random size subset of ?• Choose the th element with probability if elements have already been
selected
• What if the set is given via a stream and we don’t know its length in advance?
• There is a solution similar to the previous one, but the following is easier:• For the th item, sample uniformly at random, keep the items with highest
value of the
Sampling for subset sum estimation• Given a stream of positive weights, we want to keep a small amount of
information so that later we can estimate the weight of any given subset (the weight of a subset is the sum of the weights in it)
First Solution (Poisson sampling)• Choose any probabilities for each weight• On encountering , include it in set with probability (independent of
previous decisions) • Given any set (chosen in advance before the selection of ), estimator for • This is an unbiased estimator for • The expected number of samples
Poisson sampling• Smaller sample set does not come for free: Variance in the estimate of the weight of the th item
• One issue with this solution: The sample size is not fixed (although it can be concentrated around the mean)
• Another issue: What should be the values of the ? If we want the sample to be size in expectation, then a possible choice is
• But – may not be known– this sampling is not weight-sensitive: we may want to choose to be
larger for larger to reduce the variance
Priority sampling [Duffield et al. 2007]Second solution (priority sampling):• For each item, generate an independent uniform • Priority of item is given by • We assume all priorities are distinct (true with probability )• For a given the priority sample of size is given by the items of highest
priority • th priority, thus iff • For let if and otherwise
Properties of priority sampling:• Maintains sample of fixed size • For • And so, for T (proof on the board; also in the Duffield et al. paper)
Priority sampling propertiesWe won’t prove the following:• For distinct , and have covariance• So the variance of the estimate of the weight of a set is the sum of the
variances of the estimators for the items in the set
• The total variance (sum of variances of the estimators of all individual items) of priority sampling is near-minimal among unbiased estimators