1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

26
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

Transcript of 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

Page 1: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

1

Efficient Algorithms for Substring Near

Neighbor Problem

Alexandr Andoni

Piotr Indyk

MIT

Page 2: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

2

What’s SNN?

SNN ≈ Text Indexing with mismatches Text Indexing:

Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T

Text indexing with mismatches: Given P, find the substrings of T that are equal to P except

≤R chars.

Motivation: e.g., computational bio (BLAST)

T= GAGTAACTCAATA

P= AGTA

T= GAGTAACTCAATA

Page 3: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

3

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

Page 4: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

4

Approach (Or, why SNN?)

SNN = a near neighbor problem in Hamming metric with m dimensions: Construct data structure on

D={all substrings of T of length m}, s.t. Given P, find a point in D that is at distance ≤R

from P Use a NN data structure for Hamming

D={GAGT, AGTA, GTAA, …. AATA}

T= GAGTAACTCAATA

P= AGTA

Page 5: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

5

Approximate NN

Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time)

Approximate NN is easier Defined for approximation c=1+ε as

OK to report a point at distance ≤cR (when there is a point at distance ≤R)

Query Space

[KOR98, IM98] poly(log n, m) nO(1/ε^2)

LSH [IM98] n1/c+m n1+1/c

R

cR

q

Page 6: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

6

Our contribution

Problem: need m in advance for NN Have to construct a data structure for each m≤M

Here: approx SNN data structure for unknown m Without degradation in space or query time

Our algorithm for SNN based on LSH: Supports patterns of length m≤M Optimal* space: n1+1/c

Optimal* query time: n1/c

Slightly worse preprocessing time if c>3 (* Optimal w.r.t. LSH, modulo subpoly factors)

Also extends to l1

Page 7: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

7

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

Page 8: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

8

Locality-Sensitive Hashing

Based on a family of hash functions {g} For points P[1..m], Q[1..m]:

If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low”

Idea: Construct L hash tables with random g1, g2, … gL

For query P, look at buckets g1(P), g2(P)… gL(P) Space: L*n Query time: L

Page 9: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

9

LSH for Hamming

Hash function g: Projection on k random coordinates

E.g.: g1(“AGTA”)=“AA” (k=2)

L=#hash tables=n1/c

k=|log n / log(1-cR/m)| < m * log n

T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA …

P= AGTA

R=1

Page 10: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

10

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

Page 11: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

11

Unknown m

Bad news k dependent on m! Distinct m distinct hash tables

T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …}

HT1: GG-> GAG AT-> AGG, ACT,… …

P= AGT

R=1

g1(“AGT”)=“AT”

Page 12: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

12

Solution

Let’s just reuse the same data structure for all m g(“AGTA”)=“AA” On “AGT” have to guess last char

g(“AGT?”)=g(“AGT?”) = “A?” Like in [exact] text indexing…

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …

P= AGT

R=1

Page 13: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

13

Tries*!

Replace HT1 with

trie on g1(suffixes)

Stop searchwhen outside P

Same analysis!

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …

P= AGT

R=1

A G

A C

AGTAAATA

ACTCT

AACT

AT

… …

AGTAGTA

* Tries have been used with LSH before in [MS02], but in a different context

Page 14: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

14

Resulting performance Space:

n1+1/c (using compressed tries, one trie takes n space) Optimal!

Query time: n1/c * m (m=length P) Not [yet] really optimal: originally, could do dim-reduction Can improve to n1/c + mno(1)

Preprocessing time: n1+1/c * M (M=max m) Not optimal (optimal = n1+1/c) Can improve to n1+1/c + M1/3 * n1+o(1)

Optimal for c<3

Page 15: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

15

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

Page 16: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

16

Better query & preprocessing

Redesign LSH to improve query and preprocessing: Query: n1/c * m n1/c + mno(1)

Preprocessing: n1+1/c * M n1+1/c + n1+o(1) * M Idea for new LSH

Use same # of hash tables/tries (#=L= n1/c) But use “less randomness” in choosing hash

functions g1, g2, …gL

S.t., each gi looks random, but g’s are not independent

Page 17: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

17

New LSH scheme

Old scheme: Choose L hash functions gi

Each gi = projection on k random coordinates New scheme:

Construct the L functions gi from a smaller number of “base” hash functions

A “base” hash function = projection on k/2 random coordinates

{gi ,i =1..L} = all pairs of “base” hash functions Need only ~L1/2 “base” hash functions!

Page 18: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

18

Example

k=4

w=

#base fns=4

L=(w choose 2)=(4 choose 2)=6

u1=

u2=

u3=

u4=

g1=<u1, u2>=

g2=<u1, u3>=

g3=<u1, u4>=...

Page 19: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

19

Saving time

Can save time since there are less “base” hash functions

E.g.: computing fingerprints Want to compute FP(gi(P)) for i=1..L

FP(gi(P))=(Σj P[j] * χji * 2j) mod prime

Old way Would take L * m time for L functions g

New way Takes L1/2 * m time for L1/2 functions ui

Need only L time to combine FP(u(P)) into FP(g(P)) If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime

Total: L + L1/2 * m

Page 20: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

20

Better query & preproc (2)

E.g., for query Use fingerprints to leap faster in the trie Yields time n1/c + n1/(2c) * m (since L= n1/c)

To get n1/c + no(1) * m, generalize: g = tuple of t base functions a base function = k/t random coordinates

Other details similar to fingerprints

Page 21: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

21

Better preprocessing (3)

Preprocessing, can get n1+1/c + n1+o(1) * M

Can get n1+1/c + n1+o(1) * M1/3

Can construct a trie in n * M1/3 (instead on n * M) Using FFT, etc

Page 22: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

22

Outline

General approach View: Near Neighbor problem in Hamming metric Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution = LSH + Tries Reducing query & preprocessing

Redesign LSH Concluding remarks

Page 23: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

23

Conclusions

Problem: Substring Near Neighbor (a.k.a., text indexing with

mismatches) Approach:

View as NN in m-dimensional Hamming Use LSH

Challenge: Variable-length pattern w/o degradation in performance

Solution: Space/query optimal (w.r.t. LSH) Preprocessing optimal (w.r.t. LSH) for c<3

Page 24: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

24

Extensions

Extends to l1 Nontrivial since a need a quite different LSH

functions Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3

Using “Less-than-matching” problem [Amir-Farach’95]

Page 25: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

25

Remarks

Other approaches? Or, why LSH for SNN?

Since better SNN better NN… And LSH is the “best” known algorithm for

high-dimensional NN (using reasonable space)

Page 26: 1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

26

Thanks!