Text Joins in an RDBMS for Web Data Integration

18
Text Joins in an RDBMS for Web Data Integration Luis Gravano Panagiotis G. Ipeirotis Columbia University Nick Koudas Divesh Srivastava AT&T Labs - Research

description

Text Joins in an RDBMS for Web Data Integration. Nick Koudas Divesh Srivastava. Luis Gravano Panagiotis G. Ipeirotis. Columbia University. AT&T Labs - Research. Why Text Joins?. Problem : Same entity has multiple textual representations. Matching Text Attributes. - PowerPoint PPT Presentation

Transcript of Text Joins in an RDBMS for Web Data Integration

Page 1: Text Joins in an RDBMS  for Web Data Integration

Text Joins in an RDBMS for Web Data Integration

Luis Gravano

Panagiotis G. Ipeirotis

Columbia University

Nick Koudas

Divesh Srivastava

AT&T Labs - Research

Page 2: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 2

Why Text Joins?

Problem:

Same entity has multiple textual representations.

Web Service A

EUROAFT CORP

HATRONIC INC

Web Service B

HATRONIC CORP

EUROAFT INC

EUROAFT CORP

Page 3: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 3

Matching Text Attributes

Many desirable properties:

Match entries with typing mistakes Microsoft Windpws XP vs. Microsoft Windows XP

Match entries with abbreviated information Zurich International Airport vs. Zurich Intl. Airport

Match entries with different formatting conventions Dept. of Computer Science vs. Computer Science Dept.

…and combinations thereof

Need for a similarity metric!

Page 4: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 4

Matching Text Attributes using Edit Distance

Edit Distance: Character insertions, deletions, and modifications to transform one string to the other

EUROAFT CORP - EURODRAFT CORP → 2

COMPUTER SCI. - COMPUTER → 3

KIA INTERNATIONAL - KIA → 13

Good for: spelling errors, short word insertions and deletions

Problems: word order variations, long word insertions and deletions

“Approximate String Joins” – VLDB 2001

Page 5: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 5

Common token (low weight)

Infrequent token (high weight)

Matching Text Attributes using Cosine Similarity

Similar entries should share “infrequent” tokens

EUROAFT CORP ≈ EUROAFT INC

EUROAFT CORP ≠ HATRONIC CORP

Different token choices result in similarity metrics with different properties

Similarity = Σ weight(token, t1) * weight(token, t2)token

Page 6: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 6

Using Words and Cosine Similarity

Using words as tokens:

Common token (low weight)

Infrequent token (high weight)

EUROAFT CORP ≈ EUROAFT INC

EUROAFT CORP ≠ HATRONIC CORP

Split each entry into words

Similar entries share infrequent words

Good for word order variations and common word insert./del.

Computer Science Dept. ~ Dept. of Computer Science Problems with misspellings

Biotechnology Department ≠ Bioteknology Dept.

“WHIRL” – W.Cohen, SIGMOD’98

Page 7: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 7

Using q-grams and Cosine Similarity

Using q-grams as tokens:

Split each string into small substrings of length q (q-grams)

Similar entries share many, infrequent q-grams

Biotechnology Department

Bio, iot, ote, tec, ech, chn, hno, nol, olo, log, ogy, …, tme, men, ent

Bioteknology Department

Bio, iot, ote, tek,ekn, kno, nol, olo, log, ogy, ... , tme, men, ent

Handles naturally misspellings, word order variations, and insertions and deletions of common or short words

Page 8: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 8

Problem

Problem that we address:

Given two relations, report all pairs with cosine similarity above threshold φ

Problem that we address:

Given two relations, report all pairs with cosine similarity above threshold φ

Similarity = Σ weight(token, t1) * weight(token, t2)token

For two entries t1, t2

0 ≤ Similarity ≤ 1

Page 9: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 9

R2

Computing Text Joins in an RDBMS

Name

1 EUROAFT CORP

2 HATRONIC INC

Name

1 HATRONIC CORP

2 EUROAFT INC

3 EUROAFT CORP

R1 R2 Similarity

EUROAFT CORP EUROAFT INC 0.98

EUROAFT CORP EUROAFT CORP 1.00

EUROAFT CORP HATRONIC CORP 0.01

HATRONIC INC HATRONIC CORP 0.98

HATRONIC INC EUROAFT INC 0.02

Create in SQL relations RiWeights (token weights from Ri)

2

2

1

1

INC

HATRONIC

CORP

EUROAFT

Token

0.01

0.98

W

0.02

0.98

R1Weights

0.03CORP3

3

2

2

1

1

0.02CORP

0.05INC

0.97EUROAFT

EUROAFT

HATRONIC

Token

0.95

0.98

W

R2Weights

R1

Compute similarity of each tuple pairComputes similarity for many useless pairs

Expensive operation!

SELECT r1w.tid AS tid1, r2w.tid AS tid2

FROM R1Weights r1w, R2Weights r2w

WHERE r1w.token = r2w.token

GROUP BY r1w.tid, r2w.tid

HAVING SUM(r1w.weight*r2w.weight) ≥ φ

Page 10: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 10

Sampling Step for Text Joins

Similarity = Σ weight(token, t1) * weight(token, t2)

Similarity is a sum of products Products cannot be high when weight is small Can (safely) drop low weights from RiWeights (adapted from

[Cohen & Lewis, SODA97] for efficient execution inside an RDBMS)

Token W

EUROAFT 0.9144

HATRONIC 0.8419

CORP 0.01247

INC 0.00504

Token #TIMES SAMPLED

EUROAFT 18 (18/20=0.90)

HATRONIC 17 (17/20=0.85)→Sampling

20 times

Eliminates low similarity pairs

(e.g., “EUROAFT INC” with “HATRONIC INC”)

RiWeights

RiSample

Page 11: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 11

Sampling-Based Text Joins in SQL

Name

1 EUROAFT CORP

2 HATRONIC INC

Token W

1 EUROAFT 0.98

1 CORP 0.02

2 HATRONIC 0.98

2 INC 0.01

Token W

1 HATRONIC 0.98

1 CORP 0.02

2 EUROAFT 0.95

2 INC 0.05

3 EUROAFT 0.97

3 CORP 0.03

R1 R2 Similarity

EUROAFT CORP EUROAFT INC 0.98

EUROAFT CORP EUROAFT CORP 0.9

HATRONIC INC HATRONIC CORP 0.98

R1Weights R2Sample

R1

Fully implemented in pure SQL!

SELECT r1w.tid AS tid1, r2s.tid AS tid2

FROM R1Weights r1w, R2Sample r2s, R2sum r2sum

WHERE r1w.token = r2s.token AND r1w.token = r2sum.token

GROUP BY r1w.tid, r2s.tid

HAVING SUM(r1w.weight*r2sum.total*r2s.c) ≥ S*φ

Page 12: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 12

Experimental Setup

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9similarity

nu

mb

er o

f tu

ple

pai

rs

Q-grams, q=2

Q-grams, q=3

Words

40,000 entries from AT&T customer database, split into R1 (26,000 entries) and R2 (14,000 entries)

Tokenizations: Words Q-grams, q=2 & q=3

Methods compared: Variations of sample-based joins Baseline in SQL WHIRL [SIGMOD98], adapted for handling q-grams

Page 13: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 13

Metrics

Execute the (approximate) join for similarity > φ

Precision: (measures accuracy) Fraction of the pairs in the answer with real similarity > φ

Recall: (measures completeness) Fraction of the pairs with real similarity > φ that are also in

the answer

Execution time

Page 14: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 14

Comparing WHIRL and Sample-based Joins

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity

reca

ll

R1R2

sR1R2

R1sR2

sR1sR2

WHIRL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

similarity

pre

cisi

on

R1R2

sR1R2

R1sR2

sR1sR2

WHIRL

Sample-based Joins: Good recall across similarity thresholds WHIRL: Very low recall (almost 0 recall for thresholds below 0.7)

Page 15: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 15

Changing Sample Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity

rec

all

S=2 S=32

S=64 S=128

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1similarity

pre

cis

ion

S=2 S=32

S=64 S=128

Increased sample size → Better recall, precision

Drawback: Increased execution time

Page 16: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 16

Execution Time

0.1

1

10

100

1000

10000

S=1 S=2 S=4 S=8 S=16 S=32 S=64 S=128 S=256

sample size

exec

uti

on

tim

e (s

ecs)

R1R2 WHIRL

sR1R2 sR1sR2

WHIRL and Sample-based text joins ‘break-even’ at S~ 64, 128

Page 17: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 17

Contributions

“WHIRL [Cohen, SIGMOD98] inside an RDBMS”: Scalability, no data exporting/importing

Different tokens choices: Words: Captures word swaps, deletion of common words Q-grams: All the above, plus spelling mistakes, but slower

SQL statements tested in MS SQL Server and available for download at:

http://www.cs.columbia.edu/~pirot/DataCleaning/

Page 18: Text Joins in an RDBMS  for Web Data Integration

04/19/23 Columbia University 18

Questions?