1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM...
1
Edit Distance and
Large Data Sets
Ziv Bar-Yossef
Robert Krauthgamer
Ravi Kumar
T.S. Jayram
IBM AlmadenTechnion
2
Motivating Example:Near-Duplicate Elimination
Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97]
• Group pages into clusters of “similar” pages
• Keep one “representative” from each cluster
Crawler
Duplicate elimination
Page Repository
Page Repository
3
Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97]
• Corpus is huge (billions of pages, 10K/page)
• Streaming access
• Limited main memory
• Linear running time
Challenges
p h(p)
Locality Sensitive Hashes [Indyk, Motwani 98]
Prh[h(p) = h(q)] = sim(p,q)
Cluster:
Collection of pages that have a common sketch
• Can compute sketches in one pass
• Sketches can be stored and processed on a single machine
4
Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98]
|(q)S(p)S|
|(q)S(p)S|
ww
ww
Sw(p) Sw(q)
w-shingling:
Sw(p) = all substrings of p of length w
resemblancew(p,q) =
Pr[min((Sw(p)) = min((Sw(q))] =|(q)S(p)S|
|(q)S(p)S|
ww
ww
5
The Sketching Model
Alice Bob
Refereed(x,y) · kd(x,y) · k
x y
x)
y)
d(x,y) ¸ rd(x,y) ¸ r
Shared Randomness
Shared Randomnessk vs. r Gap
Problem
d(x,y) · k or d(x,y) ¸ r
Decide which of the two holds.
ApproximationApproximation
Promise:
Goal:
6
Applications of Sketching
Large data sets
• Clustering• Nearest Neighbor schemes• Data streams Management of Files
over the Network• Differential backup• Synchronization
Theory
• Low distortion embeddings• Simultaneous messages
communication complexity
7
Known Sketching Schemes
• Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]
• Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]
• Cosine similarity [Charikar 02]
• Earth mover distance [Charikar 02]
In this talk: Edit Distance
8
Edit Distance
x 2 n, y 2 m
Minimum number of character insertions, deletions and substitutions that transform x to y.
Examples:
ED(00000, 1111) = 5
ED(01010, 10101) = 2
Applications
• Genomics
• Text processing
• Web searchFor simplicity: m = n, = {0,1}.
ED(x,y):
9
Computing Edit Distance
• Dynamic programming (1970) O(n2)• Masek and Paterson (1980) O(n2/log n)
Exact Computation
• Impractical for comparing two very long strings.
• Natural question 1: can we do it in linear time?
• Impractical for handling massive document repositories.
• Natural question 2: are there constant size sketches of edit distance?
Can we solve the above problems if we settle for approximation?
Can we solve the above problems if we settle for approximation?
Focus of this
talk
10
Sketching Schemes for Edit Distance
Algorithm Gap Sketch size
Batu et al O(n) vs. (n) O(nmax(/2, 2 – 1))
This paper k vs. O((kn)2/3) O(1)
This paper
(non-repetitive strings)
k vs. O(k2) O(1)
• No known embeddings of Edit distance into a normed space.
• Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03]
• Weak nearest neighbor schemes [Indyk 04]
Negative Indications
11
Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98]
Ham(x,y) = # of positions in which x,y differ
Gap: k vs. 2k Sketch size: O(1)
Shared randomness:
r1,…,rn 2 {0,1} are independent and
Sketch: h(x) = (i xi ri ) mod 2
h(y) = (i yi ri ) mod 2
Analysis:
Pr[h(x) h(y)] =
Pr[h(x) + h(y) = 1] =
Pr[i: xi yi ri = 1] =
½(1- (1 – 1/k)Ham(x,y))
x) = (h1(x),…,ht(x)), y) = (h1(y),…,ht(y)), t = O(1)
12
Edit Distance Sketches: Basic Framework
Underlying Principle
ED(x,y) is small iff x and y share many common substrings at nearby positions.
Sx = set of pairs of the form (,h(i))
a substring of x
h(i): a “locality sensitive” encoding of the substring’s position
x
Sx
y
Sy
ED(x,y) small iff intersection Sx Å Sy
large
common substrings at nearby positions
13
Basic Framework (cont.)
•Need to estimate size of symmetric difference
•Hamming distance computation of characteristic vectors
•Use constant size sketches [KOR]
x
Sx
y
Sy
ED(x,y) small iff symmetric difference
Sx Sy small
Reduced Edit Distance to Hamming DistanceReduced Edit Distance to Hamming Distance
14
1 2 3
12 3
General Case: Encoding Scheme
Gap: k vs. O((kn)2/3)
x
y
B = n2/3/k1/3, W = n/B
1
Sx = {
Sy = {
2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
(1,1),
(1,1),
(2,1),
(2,1),
(3,2),
(3,2),
…
…
B windows of size W each.
,(i, win(i)),…
,(i, win(i)),…
15
Analysis
j
ix
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Case 1: ED(x,y) · k
• If i is “unmarked”, it has a matching “companion” j
• (i,win(i)) 2 Sx n Sy, only if:
• either i is “marked”
• or i is unmarked, but win(i) win(j)
• At most kB marked substrings• At most k * n/W = kB companions with mismatched windows
• Therefore, Ham(Sx,Sy) · 4kB
16
Analysis (cont.)
2
1x
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Case 2: Ham(Sx,Sy) · 8kB
• If i has a “companion” j and win(i) = win(j), can align i with
j using at most W operations
• Otherwise, substitute first character of i
• At most 8kB substrings of x have no companion• Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)
B+1 2B+1
B-1
17
y2
x2
y1
x1
Non-repetitive Case: Encoding Scheme
1 2 3 4 5 6 7
1 2 3 4 5 67
t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W
x
y
W
W
Alice and Bob choose a sequence of “anchors” in a coordinated way
1: a random permutation on {0,1}t
1: minimal length-t substring of x1 (under 1)
1: minimal length-t substring of y1 (under 1)
Gap: k vs. O(k W)
18
11
Encoding scheme (cont.)
2 3 4 5 6 7
1 2 3 4 5 6 7
2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
x
y
Sx = { (1,1),…,(8,8) }
Sy = { (1,1),…,(8,8) }
19
1 2 3 4 5 67
1 2 3 4 5 6 71 2 3 4 5 6 7 8
Analysis
Case 1: ED(x,y) · k.
•All anchors are “unmarked” with probability 1 - kt/W = (1)
•If i,i are unmarked, they are aligned
•# of mismatching substrings · 2k
•Ham(Sx,Sy) · 2k
x
y 1 2 3 4 5 6 7 8
20
1 2 3 4 5 671 2 3 4 5 6 7 8
1 2 3 4 5 6 71 2 3 4 5 6 7 8
Analysis (cont.)
Case 2: Ham(Sx,Sy) · 4k
•# of mismatching substrings · 4k
•ED(x,y) · 2 ¢ W ¢ 4k = O(k W).
x
y
21
Approximation in Linear Time
Algorithm Gap Time Approx. factor in O(n) time
Dynamic Programming
k vs. k+1 O(kn) None
Batu et al O(n) vs. (n) O(nmax(/2, 2-1)) None
Cole, Hariharan k vs. 2k O(n + k4) O(n3/4)
This paper k vs. k7/4 O(n) O(n3/7)
Algorithm Gap Time Approx. factor in O(n) time
Cole, Hariharan k vs. 2k O(n + k3) O(n2/3)
This paper k vs. k3/2 O(n) O(n1/3)
Non-repetitive Strings
Arbitrary Strings
22
Summary and Open Problems• Designed efficient approximation schemes for edit
distance.– Best sketching and linear-time approximations to date
• Subsequent work:– O(n2/3) distortion embedding of edit distance into L1 [Indyk 04]
[Rabani 04]
– Better embeddings of edit distance into L1 [Ostrovsky, Rabani, 05]
– Embeddings of the Ulam metric into L1 [Charikar, Krauthgamer, 05]
• Open Problems– Sketch size lower bounds– Constant factor approximations in linear time– Better embeddings of edit distance– Sketching schemes for other distance measures
23
Thank You