Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N....
-
Upload
liliana-louise-cannon -
Category
Documents
-
view
223 -
download
0
description
Transcript of Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N....
Fast Indexes and AlgorithmsFor Set Similarity Selection Queries
M. HadjieleftheriouA. ChandelN. KoudasD. Srivastava
Strings as sets
s1 = “Main St. Maine”:• ‘Main’ ‘St.’ ‘Maine’• ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ …
s2 = “Main St. Main”:• ‘Main’ ‘St.’ ‘Main’
How similar is s1 and s2 ?
TF/IDF weighted similarity
Inverse Document Frequency (idf):• ‘Main’ is common• ‘Maine’ is not• idf(t) = log2[1 + N / df(t)]
Term Frequency (tf):• ‘Main’ appears twice in s2
Similarity:• Inner Product
Is TF important?
Information retrieval:• Given a query string retrieve relevant
documentsRelational databases:
• Given a query string retrieve relevant strings
In practice TF is small in many applications
IDF similarity
Query q = {t1, …, tn}Set s = {r1, …, rm}Length len(s) = (t 2 s idf(t)2)1/2
I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q)
IDF is as good as TF/IDF in practice!
How can I build an index?
Let w(t, s) = idf(t) / len(s)Then I(q, s) = t 2 q \ s w(t, s) w(t, q)So
• Decompose strings into tokens• Compute the idf of each token• Create one inverted list per token
Sort lists by string id: Do a merge joinSort lists by w: Run TA/NRA
Example: Sort by id
Example: Sort by w
NRA:• Round robin list accesses• Main memory hash table• Computes lower and upper bounds per entry
Semantic properties of IDF
Order Preservation:• For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) <
w(t2, r)
Length Boundedness:• Query q, set s, threshold
– I(q, s) >= ) len(q) < len(s) < len(q) /
Improved NRA
Order Preservation determines if a given set appears in a list or not• ti: encounter s1, then s2
• tk: encounter s2 first
Length Boundedness restricts the search in a small portion of lists
Something surprising
Lemma: NRA reads arbitrarily more elements than iNRA
Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property
Any other strategies?
NRA style is breadth-firstTry depth-first:
• Sort query lists in decreasing idf order– Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn)
• Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti– i = I <= k <= n idf(tk)2 / len(q)
• i is a natural cutoff point• 1 > 2 > … > n
Shortest-First
Sort q={t1, …, tn} in decreasing idf orderLet candidate set CFor 1 <= i <= n
• Skip to first entry with len(s) >= len(q)• Compute i• Let i = min(i, len(q) / )• Repeat
– s = pop next element from ti– Maintain lower/upper bounds of entries in C
• Until len(s) > max(max len C, i)
Comparison with NRA
Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF
But surprisingly
A hybrid strategy
Run iNRA normallyUse i and max len C to stop reading from a
particular list• This guarantees that iNRA stops with or before
SF
Drawback of NRA variants:• Very high book keeping cost compared to SF
Experiments
DBLP, IMDB and YellowPages datasetsActors, movies, authors, businesses etc.Vary threshold, query size, query strings and
mistakesTest wall-clock time, pruning powerAlgorithms:NRA, TA, iNRA, iTA, SF, Hybrid,
Sort-by-id, Improved SQL based
Wall-clock time vs. Threshold
Wall-clock time vs. Query size
TA
NRA
Sort-by-id
iTA
SF
Space
Conclusion
Proposed a simplified TF/IDF measureIdentified strong monotonicity propertiesUsed the properties to design efficient
algorithmsSF works best overall in practice
• Achieves sub-second answers in most practical cases
Q&A
Pruning power vs. Threshold
Pruning power vs. Query size
NRA TA
iTA