Document Retrieval Problems

22
Document Retrieval Problems S. Muthukrishnan

description

Document Retrieval Problems. S. Muthukrishnan. Storyline. Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching ….. Update on the status of open problems. Eric Allender invited me to give a string matching talk at Rutgers U. - PowerPoint PPT Presentation

Transcript of Document Retrieval Problems

Page 1: Document Retrieval Problems

Document Retrieval Problems

S. Muthukrishnan

Page 2: Document Retrieval Problems

Storyline

• Zvi Galil gave a talk on the 13th on 13 open problems he posed 13 years ago in string matching …..– Update on the status of open problems.

• Eric Allender invited me to give a string matching talk at Rutgers U.

• Gives me a chance to look through 30 years of history.

Fernand Braudel

History may be divided into three movements:what moves rapidly,what moves slowly, andwhat appears not to move at all.

Page 3: Document Retrieval Problems

The Key Problem

• Given a set of documents D to be preprocessed, query is to list all the locations in the documents where a given pattern occurs. occurrence listing

• Given a set of documents D to be preprocessed, query is to list all the documents in which a given pattern occurs. document listing

D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bcP= aaO={ (1,1), (1,4), (2,3), (2,4) }

D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bcP= aaO={ 1, 2}

Muthu:

Use this problem to frame the discussion,

Muthu:

Use this problem to frame the discussion,

Page 4: Document Retrieval Problems

Occurrence Vs Document Listing

• Given n documents of total length N, occurrence listing can be solved with– O(N) preprocessing and.– O(m + output) time for query pattern of size m.– Elegant 1973 paper by Weiner introduced suffix trees

and solved this problem – optimal, output sensitive.

• No such optimal result for document listing.– O( (m+out) log n ) time query processing.– log n loglog n by fractional cascading.

muthu: assuming you don’t hastily give the answer without looking at the entire document or the pattern!:

muthu: assuming you don’t hastily give the answer without looking at the entire document or the pattern!:

Page 5: Document Retrieval Problems

Other Document Listing Problems

• Find all document that contain at least K occurrences of the given pattern. (mining)

• Find all documents that contain two occurrences of the pattern separated by at most distance d. (proximity repeat)

• Find all documents that do NOT contain the given pattern. (negative query)

• Find all documents that contain pattern P but not Q. (boolean query)

• Combinations thereof…

Muthu:

Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns.

Muthu:

Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns.

Page 6: Document Retrieval Problems

Nature of Document RetrievalProblems

• Document listing versions are natural. – Occurrence listing versions primarily studied in

Computational Biology and Data Mining.

• No optimal algorithms previously known.– Bounds are off by factors of log n … n in the worst

case depending on the problem.

• We will provide (near) optimal algorithms.– Optimal algorithm for key document listing problem.

Muthu:

Motivated the discussion with this problem,

It is also framed in history.

Muthu:

Motivated the discussion with this problem,

It is also framed in history.

Theory following Practice?Inverted word index + variants, in IR.

Page 7: Document Retrieval Problems

Talk Overview

• Optimal algorithm for the document listing problem.– List all documents that contain the given pattern.

• Efficient algorithm for the document mining problem.– List all documents that contain at least K

occurrences of the given pattern.

• Techniques. – Colored range query data structural problems.

Page 8: Document Retrieval Problems

Preamble: Occurrence Listing• Construct a suffix tree (compressed trie) of all

the documents. D= {abaa, aabaa, bc }S = {abaa#, baa#, aa#, a#, aabaa#, bc#, c#}

ab

c#

c#

aa#

#

#a

baa#baa#

(1,4), (2,5)

(1,3), (2,4)

(2,1)(1,1), (2,2)

(1,2), (2,3)

(3,1)

(3,2)

http://commfaculty.fullerton.edu/lester/writings/1000_words.html

Page 9: Document Retrieval Problems

Preamble: Occurrence Listing

• Find all occurrences of pattern aa.– Trace down the path aa and report all the leaves

[Weiner 73].

Input:D= {abaa, aabaa, bc }

Output:(1,3), (2,4), (2,1)

ab

c#

c#

aa#

#

#a

baa#baa#

(1,4), (2,5)

(1,3), (2,4)

(2,1)(1,1), (2,2)

(1,2), (2,3)

(3,1)

(3,2)

Page 10: Document Retrieval Problems

Document Listing

• Find all documents that contain pattern aa.– Trace down the path aa and report the distinct

“colors” on leaves.

ab

c#

c#

aa#

#

#a

baa#baa#

1, 2

1, 2

21, 2

1, 2

3

3Input:D= {abaa, aabaa, bc }

Output sought:1, 2

Colors: 1, 2, 3

Challenge: Avoid reporting duplicate colors.

muthu:

Use hot pink sparingly

muthu:

Use hot pink sparingly

Page 11: Document Retrieval Problems

Document Listing: Our Approacha

bc#

c#

aa#

#

#a

baa#baa#

1, 2

1, 2

21, 2

1, 2

3

3

1 2 1 2 2 1 2 1 2 3 3

Colored range query:Return distinct colors in given range.

Mathematics is the art of giving the same name to different things. --- Jules Henri Poincare

Page 12: Document Retrieval Problems

Document Listing: Our Approach

1 2 3 4 5 6 7 8 9 10 11

1 2 1 2 2 1 2 1 2 3 3

1 2 3 4 5 6 7 8 9 10 11

-1 -1 1 2 4 3 5 1 7 -1 10

List distinct colors

List numbers less than 3.Colors do not matter anymore.

Page 13: Document Retrieval Problems

Document Listing: Our Approach

1 2 3 4 5 6 7 8 9 10 11

-1 -1 1 2 4 3 5 1 7 -1 10

List numbers less than 3.

R = (l,r). Find all integers smaller than x in A[l,r]:

1. Perform rangemin(R) to determine i such that A[i] is smallest in A[l,r].

2. If A[i] is smaller than x, recurse on A[l,i-1] and A[i+1,r] and return A[i].

O(1) time per rangemin query

O(output) time.

Page 14: Document Retrieval Problems

Document Listing: Summary

• Given a set of documents of total size N, document listing problem can be solved in– O(N) time and space for preprocessing, and.– O(m + output) time for a query of size m.– Uses Weiner’s O(N) time suffix tree construction.

• Overview of techniques– Reduce the problem to colored range searching.– “Chain” occurrences of suffixes from each document,

Necessity is not necessarily the mother of invention. Ruth Benedict in Patterns of Culture.

Muthu: Now, let us get started with fun stuff.

Muthu: Now, let us get started with fun stuff.

Page 15: Document Retrieval Problems

Document Mining

• Find all documents that contain at least K occurrences of given pattern.

Find colors that appear at least K times in this range.

Page 16: Document Retrieval Problems

Document Mining: First Approach

• Fix K.

Chain to the Kth occurrenceof red to the left.

Given range [l,r], determine all numbers in A[l,r] that are less than l.

Does not work: output * KYesterday it workedToday it is not workingWindows is like that.

Page 17: Document Retrieval Problems

Document Mining: Second Approach

• Given a set of colored intervals to be preprocessed, query is some interval I and we must determine the distinct colored intervals that are contained in I.

Chain to the Kth occurrence of red to the left. Replace by red intervals.

No optimal results known

Page 18: Document Retrieval Problems

Document Mining: Fixed K

Mark Least Common Ancestor (L,R) with red color.L

R

Each query Find the set of distinct colors in a subtree.O(N) preprocessing, O( m + output) time per query

Page 19: Document Retrieval Problems

Document Mining: Variable K• K is part of the query: o(NK) preprocessing?

1 K2 3 K+1 K+2 2K-1

• For a fixed K, all LCAs lie in paths separated by K occurrences.• Suffices to keep the lowest in each path.

muthu: that deserves the hot pink.

muthu: that deserves the hot pink.

Page 20: Document Retrieval Problems

Document Mining: Variable K

• For a fixed K, find the lowest LCA on each of the paths separated by O(K) occurrences of each document.

• Preprocessing time: bin searching paths.

• Query processing in O(m + output) time.

)log(log

log 2

,NNO

K

KNK

K

lKKi

i

Page 21: Document Retrieval Problems

Summary

• Solving other document listing problems.– Optimal for negative query: list absent colors.– (Near) optimal for proximity repeats: structural

properties of “gaps.”– Best known for two patterns: breaking the

quadratic preprocessing bottleneck.

• Techniques: Chaining, Colored range queries (7+ such problems in the paper), Combinatorial structure.

Muthu: Solving these colored range searching problems are of independent interest….

Muthu: Solving these colored range searching problems are of independent interest….

muthu:

Hope that whetted your appetite for algorithmics.

muthu:

Hope that whetted your appetite for algorithmics.

Page 22: Document Retrieval Problems

Discussion

• “non” local chaining? – Find documents in which no two occurrences of the

pattern are within distance K. OPEN

• Try it in IPScope: Interactive Patents Analysis System.