Text Joins in an RDBMS for Web Data Integration

Text Joins in an RDBMS for Web Data Integration

Luis Gravano

Panagiotis G. Ipeirotis

Columbia University

Nick Koudas

Divesh Srivastava

AT&T Labs - Research

04/19/23 Columbia University 2

Why Text Joins?

Problem:

Same entity has multiple textual representations.

Web Service A

EUROAFT CORP

HATRONIC INC

…

Web Service B

HATRONIC CORP

EUROAFT INC

EUROAFT CORP

…


Matching Text Attributes

Many desirable properties:

Match entries with typing mistakes Microsoft Windpws XP vs. Microsoft Windows XP

Match entries with abbreviated information Zurich International Airport vs. Zurich Intl. Airport

Match entries with different formatting conventions Dept. of Computer Science vs. Computer Science Dept.

…and combinations thereof

Need for a similarity metric!


Matching Text Attributes using Edit Distance

Edit Distance: Character insertions, deletions, and modifications to transform one string to the other

EUROAFT CORP - EURODRAFT CORP → 2

COMPUTER SCI. - COMPUTER → 3

KIA INTERNATIONAL - KIA → 13

Good for: spelling errors, short word insertions and deletions

Problems: word order variations, long word insertions and deletions

“Approximate String Joins” – VLDB 2001


Common token (low weight)

Infrequent token (high weight)

Matching Text Attributes using Cosine Similarity

Similar entries should share “infrequent” tokens

EUROAFT CORP ≈ EUROAFT INC

EUROAFT CORP ≠ HATRONIC CORP

Different token choices result in similarity metrics with different properties

Similarity = Σ weight(token, t1) * weight(token, t2)token


Using Words and Cosine Similarity

Using words as tokens:

Common token (low weight)

Infrequent token (high weight)

EUROAFT CORP ≈ EUROAFT INC

EUROAFT CORP ≠ HATRONIC CORP

Split each entry into words

Similar entries share infrequent words

Good for word order variations and common word insert./del.

Computer Science Dept. ~ Dept. of Computer Science Problems with misspellings

Biotechnology Department ≠ Bioteknology Dept.

“WHIRL” – W.Cohen, SIGMOD’98


Using q-grams and Cosine Similarity

Using q-grams as tokens:

Split each string into small substrings of length q (q-grams)

Similar entries share many, infrequent q-grams

Biotechnology Department

Bio, iot, ote, tec, ech, chn, hno, nol, olo, log, ogy, …, tme, men, ent

Bioteknology Department

Bio, iot, ote, tek,ekn, kno, nol, olo, log, ogy, ... , tme, men, ent

Handles naturally misspellings, word order variations, and insertions and deletions of common or short words


Problem

Problem that we address:

Given two relations, report all pairs with cosine similarity above threshold φ

Problem that we address:

Given two relations, report all pairs with cosine similarity above threshold φ

Similarity = Σ weight(token, t1) * weight(token, t2)token

For two entries t1, t2

0 ≤ Similarity ≤ 1


R2

Computing Text Joins in an RDBMS

Name

1 EUROAFT CORP

2 HATRONIC INC

…

Name

1 HATRONIC CORP

2 EUROAFT INC

3 EUROAFT CORP

…

R1 R2 Similarity

EUROAFT CORP EUROAFT INC 0.98

EUROAFT CORP EUROAFT CORP 1.00

EUROAFT CORP HATRONIC CORP 0.01

HATRONIC INC HATRONIC CORP 0.98

HATRONIC INC EUROAFT INC 0.02

Create in SQL relations RiWeights (token weights from Ri)

2

2

1

1

…

INC

HATRONIC

CORP

EUROAFT

Token

0.01

0.98

W

0.02

0.98

R1Weights

0.03CORP3

3

2

2

1

1

0.02CORP

0.05INC

0.97EUROAFT

…

EUROAFT

HATRONIC

Token

0.95

0.98

W

R2Weights

R1

Compute similarity of each tuple pairComputes similarity for many useless pairs

Expensive operation!

SELECT r1w.tid AS tid1, r2w.tid AS tid2

FROM R1Weights r1w, R2Weights r2w

WHERE r1w.token = r2w.token

GROUP BY r1w.tid, r2w.tid

HAVING SUM(r1w.weight*r2w.weight) ≥ φ


Sampling Step for Text Joins

Similarity = Σ weight(token, t1) * weight(token, t2)

Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from

[Cohen & Lewis, SODA97] for efficient execution inside an RDBMS)

Token W

EUROAFT 0.9144

HATRONIC 0.8419

…

CORP 0.01247

INC 0.00504

Token #TIMES SAMPLED

EUROAFT 18 (18/20=0.90)

HATRONIC 17 (17/20=0.85)→Sampling

20 times

Eliminates low similarity pairs

(e.g., “EUROAFT INC” with “HATRONIC INC”)

RiWeights

RiSample


Sampling-Based Text Joins in SQL

Name

1 EUROAFT CORP

2 HATRONIC INC

…

Token W

1 EUROAFT 0.98

1 CORP 0.02

2 HATRONIC 0.98

2 INC 0.01

…

Token W

1 HATRONIC 0.98

1 CORP 0.02

2 EUROAFT 0.95

2 INC 0.05

3 EUROAFT 0.97

3 CORP 0.03

R1 R2 Similarity

EUROAFT CORP EUROAFT INC 0.98

EUROAFT CORP EUROAFT CORP 0.9

HATRONIC INC HATRONIC CORP 0.98

R1Weights R2Sample

R1

Fully implemented in pure SQL!

SELECT r1w.tid AS tid1, r2s.tid AS tid2

FROM R1Weights r1w, R2Sample r2s, R2sum r2sum

WHERE r1w.token = r2s.token AND r1w.token = r2sum.token

GROUP BY r1w.tid, r2s.tid

HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ


Experimental Setup

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9similarity

nu

mb

er o

f tu

ple

pai

rs

Q-grams, q=2

Q-grams, q=3

Words

40,000 entries from AT&T customer database, split into R1 (26,000 entries) and R2 (14,000 entries)

Tokenizations: Words Q-grams, q=2 & q=3

Methods compared: Variations of sample-based joins Baseline in SQL WHIRL [SIGMOD98], adapted for handling q-grams


Metrics

Execute the (approximate) join for similarity > φ

Precision: (measures accuracy) Fraction of the pairs in the answer with real similarity > φ

Recall: (measures completeness) Fraction of the pairs with real similarity > φ that are also in

the answer

Execution time


Comparing WHIRL and Sample-based Joins

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity

reca

ll

R1R2

sR1R2

R1sR2

sR1sR2

WHIRL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

similarity

pre

cisi

on

R1R2

sR1R2

R1sR2

sR1sR2

WHIRL

Sample-based Joins: Good recall across similarity thresholds WHIRL: Very low recall (almost 0 recall for thresholds below 0.7)


Changing Sample Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity

rec

all

S=2 S=32

S=64 S=128

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity

pre

cis

ion

S=2 S=32

S=64 S=128

Increased sample size → Better recall, precision

Drawback: Increased execution time


Execution Time

0.1

1

10

100

1000

10000

S=1 S=2 S=4 S=8 S=16 S=32 S=64 S=128 S=256

sample size

exec

uti

on

tim

e (s

ecs)

R1R2 WHIRL

sR1R2 sR1sR2

WHIRL and Sample-based text joins ‘break-even’ at S~ 64, 128


Contributions

“WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing

Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower

SQL statements tested in MS SQL Server and available for download at:

http://www.cs.columbia.edu/~pirot/DataCleaning/


Questions?

Text Joins in an RDBMS for Web Data Integration

Documents

Transcript of Text Joins in an RDBMS for Web Data Integration