Text Joins in an RDBMS for Web Data Integration
description
Transcript of Text Joins in an RDBMS for Web Data Integration
Text Joins in an RDBMS for Web Data Integration
Luis Gravano
Panagiotis G. Ipeirotis
Columbia University
Nick Koudas
Divesh Srivastava
AT&T Labs - Research
04/19/23 Columbia University 2
Why Text Joins?
Problem:
Same entity has multiple textual representations.
Web Service A
EUROAFT CORP
HATRONIC INC
…
Web Service B
HATRONIC CORP
EUROAFT INC
EUROAFT CORP
…
04/19/23 Columbia University 3
Matching Text Attributes
Many desirable properties:
Match entries with typing mistakes Microsoft Windpws XP vs. Microsoft Windows XP
Match entries with abbreviated information Zurich International Airport vs. Zurich Intl. Airport
Match entries with different formatting conventions Dept. of Computer Science vs. Computer Science Dept.
…and combinations thereof
Need for a similarity metric!
04/19/23 Columbia University 4
Matching Text Attributes using Edit Distance
Edit Distance: Character insertions, deletions, and modifications to transform one string to the other
EUROAFT CORP - EURODRAFT CORP → 2
COMPUTER SCI. - COMPUTER → 3
KIA INTERNATIONAL - KIA → 13
Good for: spelling errors, short word insertions and deletions
Problems: word order variations, long word insertions and deletions
“Approximate String Joins” – VLDB 2001
04/19/23 Columbia University 5
Common token (low weight)
Infrequent token (high weight)
Matching Text Attributes using Cosine Similarity
Similar entries should share “infrequent” tokens
EUROAFT CORP ≈ EUROAFT INC
EUROAFT CORP ≠ HATRONIC CORP
Different token choices result in similarity metrics with different properties
Similarity = Σ weight(token, t1) * weight(token, t2)token
04/19/23 Columbia University 6
Using Words and Cosine Similarity
Using words as tokens:
Common token (low weight)
Infrequent token (high weight)
EUROAFT CORP ≈ EUROAFT INC
EUROAFT CORP ≠ HATRONIC CORP
Split each entry into words
Similar entries share infrequent words
Good for word order variations and common word insert./del.
Computer Science Dept. ~ Dept. of Computer Science Problems with misspellings
Biotechnology Department ≠ Bioteknology Dept.
“WHIRL” – W.Cohen, SIGMOD’98
04/19/23 Columbia University 7
Using q-grams and Cosine Similarity
Using q-grams as tokens:
Split each string into small substrings of length q (q-grams)
Similar entries share many, infrequent q-grams
Biotechnology Department
Bio, iot, ote, tec, ech, chn, hno, nol, olo, log, ogy, …, tme, men, ent
Bioteknology Department
Bio, iot, ote, tek,ekn, kno, nol, olo, log, ogy, ... , tme, men, ent
Handles naturally misspellings, word order variations, and insertions and deletions of common or short words
04/19/23 Columbia University 8
Problem
Problem that we address:
Given two relations, report all pairs with cosine similarity above threshold φ
Problem that we address:
Given two relations, report all pairs with cosine similarity above threshold φ
Similarity = Σ weight(token, t1) * weight(token, t2)token
For two entries t1, t2
0 ≤ Similarity ≤ 1
04/19/23 Columbia University 9
R2
Computing Text Joins in an RDBMS
Name
1 EUROAFT CORP
2 HATRONIC INC
…
Name
1 HATRONIC CORP
2 EUROAFT INC
3 EUROAFT CORP
…
R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 1.00
EUROAFT CORP HATRONIC CORP 0.01
HATRONIC INC HATRONIC CORP 0.98
HATRONIC INC EUROAFT INC 0.02
Create in SQL relations RiWeights (token weights from Ri)
2
2
1
1
…
INC
HATRONIC
CORP
EUROAFT
Token
0.01
0.98
W
0.02
0.98
R1Weights
0.03CORP3
3
2
2
1
1
0.02CORP
0.05INC
0.97EUROAFT
…
EUROAFT
HATRONIC
Token
0.95
0.98
W
R2Weights
R1
Compute similarity of each tuple pairComputes similarity for many useless pairs
Expensive operation!
SELECT r1w.tid AS tid1, r2w.tid AS tid2
FROM R1Weights r1w, R2Weights r2w
WHERE r1w.token = r2w.token
GROUP BY r1w.tid, r2w.tid
HAVING SUM(r1w.weight*r2w.weight) ≥ φ
04/19/23 Columbia University 10
Sampling Step for Text Joins
Similarity = Σ weight(token, t1) * weight(token, t2)
Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from
[Cohen & Lewis, SODA97] for efficient execution inside an RDBMS)
Token W
EUROAFT 0.9144
HATRONIC 0.8419
…
CORP 0.01247
INC 0.00504
Token #TIMES SAMPLED
EUROAFT 18 (18/20=0.90)
HATRONIC 17 (17/20=0.85)→Sampling
20 times
Eliminates low similarity pairs
(e.g., “EUROAFT INC” with “HATRONIC INC”)
RiWeights
RiSample
04/19/23 Columbia University 11
Sampling-Based Text Joins in SQL
Name
1 EUROAFT CORP
2 HATRONIC INC
…
Token W
1 EUROAFT 0.98
1 CORP 0.02
2 HATRONIC 0.98
2 INC 0.01
…
Token W
1 HATRONIC 0.98
1 CORP 0.02
2 EUROAFT 0.95
2 INC 0.05
3 EUROAFT 0.97
3 CORP 0.03
R1 R2 Similarity
EUROAFT CORP EUROAFT INC 0.98
EUROAFT CORP EUROAFT CORP 0.9
HATRONIC INC HATRONIC CORP 0.98
R1Weights R2Sample
R1
Fully implemented in pure SQL!
SELECT r1w.tid AS tid1, r2s.tid AS tid2
FROM R1Weights r1w, R2Sample r2s, R2sum r2sum
WHERE r1w.token = r2s.token AND r1w.token = r2sum.token
GROUP BY r1w.tid, r2s.tid
HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ
04/19/23 Columbia University 12
Experimental Setup
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9similarity
nu
mb
er o
f tu
ple
pai
rs
Q-grams, q=2
Q-grams, q=3
Words
40,000 entries from AT&T customer database, split into R1 (26,000 entries) and R2 (14,000 entries)
Tokenizations: Words Q-grams, q=2 & q=3
Methods compared: Variations of sample-based joins Baseline in SQL WHIRL [SIGMOD98], adapted for handling q-grams
04/19/23 Columbia University 13
Metrics
Execute the (approximate) join for similarity > φ
Precision: (measures accuracy) Fraction of the pairs in the answer with real similarity > φ
Recall: (measures completeness) Fraction of the pairs with real similarity > φ that are also in
the answer
Execution time
04/19/23 Columbia University 14
Comparing WHIRL and Sample-based Joins
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity
reca
ll
R1R2
sR1R2
R1sR2
sR1sR2
WHIRL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
similarity
pre
cisi
on
R1R2
sR1R2
R1sR2
sR1sR2
WHIRL
Sample-based Joins: Good recall across similarity thresholds WHIRL: Very low recall (almost 0 recall for thresholds below 0.7)
04/19/23 Columbia University 15
Changing Sample Size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity
rec
all
S=2 S=32
S=64 S=128
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity
pre
cis
ion
S=2 S=32
S=64 S=128
Increased sample size → Better recall, precision
Drawback: Increased execution time
04/19/23 Columbia University 16
Execution Time
0.1
1
10
100
1000
10000
S=1 S=2 S=4 S=8 S=16 S=32 S=64 S=128 S=256
sample size
exec
uti
on
tim
e (s
ecs)
R1R2 WHIRL
sR1R2 sR1sR2
WHIRL and Sample-based text joins ‘break-even’ at S~ 64, 128
04/19/23 Columbia University 17
Contributions
“WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing
Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower
SQL statements tested in MS SQL Server and available for download at:
http://www.cs.columbia.edu/~pirot/DataCleaning/
04/19/23 Columbia University 18
Questions?