Data Cleansing Exercise: Duplicate Detection · Data Cleansing Exercise: Duplicate Detection...

34
Data Cleansing Exercise: Duplicate Detection Thorsten Papenbrock PhD Candidate Hasso-Plattner-Institute

Transcript of Data Cleansing Exercise: Duplicate Detection · Data Cleansing Exercise: Duplicate Detection...

Data Cleansing Exercise: Duplicate Detection

Thorsten Papenbrock

PhD Candidate

Hasso-Plattner-Institute

Advanced Profiling

Three important metadata

Chart 2

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Normalization criterion

Atmosphere Rings

Foreign key candidates

Domicile ⊆ Name

Key candidates

|Name|

Name Type Equatorial diameter

Mass

Mercury Terrestrial 0.382 0.06

Venus Terrestrial 0.949 0.82

Earth Terrestrial 1.000 1.00

Mars Terrestrial 0.532 0.11

Jupiter Giant 11.209 317.8

Saturn Giant 9.449 95.2

Uranus Giant 4.007 14.6

... ... ... ...

Name Type

Mercury Terrestrial

Venus Terrestrial

Earth Terrestrial

Mars Terrestrial

Jupiter Giant

Saturn Giant

Uranus Giant

... ...

Sign Domicile

Aries Mars

Taurus Venus

Gemini Mercury

Cancer Moon

Leo Sun

Virgo Mercury

Libra Venus

Scorpio Pluto

Sagittarius Jupiter

Capricorn Saturn

Aquarius Uranus

... ...

Name Atmosphere Rings

Mercury minimal no

Venus CO2, N2 no

Earth N2, O2, Ar no

Mars CO2, N2, Ar no

Jupiter H2, He yes

Saturn H2, He yes

Uranus H2, He yes

... ... ...

Exercise 3

Discovery of functional dependencies

Chart 3

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ All teams have passed the exercise:

□ 34 submissions

□ No duplicate algorithm names!

□ Still a few incorrect results (even after correction round)

□ No import errors in Metanome (apart from execution errors)

Exercise 3

Short presentations – Part 1

Chart 4

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Functional

Dependencies

Exercise 3

Our evaluation

Chart 5

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ DELL Optiplex 9010

□ CPU: Intel i5 3.2 GHz

□ RAM: 8 GB (2 GB for Metanome JVM)

□ OS: Debian 64-bit

□ JVM: Java 1.8

Exercise 3

Correctness for abalone.csv

Chart 6

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil

AlexoFredFunctionals

dennis_marius.fd

DJ_FD

dpdc-cnms-fd

DreamteamFd

FanctionalDepundancy

fastTane

fd_finke_dullweber

FD_grundke_wiese

FD_JungRohloff

FD_Kirsten_Zwerg

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus

FDFMJR

FdPerchykSchmidt

FrohnOttoFuncDep-LAME-TANE

FuncDep

FunctionalDependencyDetector

FunctionalDerpendency

GottaCatchAllFD

HorLehTane

klinger_marten_fd

LucieKerstinFD

MMFUncDep

MyFd

PCFD

PutePute

RT_FD

SBMMFD

smart_data_cat-FD

Tsun12Fd

YuckFunc

Exercise 3

Correctness for abalone.csv

Chart 7

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil

AlexoFredFunctionals

dennis_marius.fd

DJ_FD

dpdc-cnms-fd

DreamteamFd

FanctionalDepundancy

fastTane

fd_finke_dullweber

FD_grundke_wiese

FD_JungRohloff

FD_Kirsten_Zwerg

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus

FDFMJR

FdPerchykSchmidt

FrohnOttoFuncDep-LAME-TANE

FuncDep

FunctionalDependencyDetector

FunctionalDerpendency incorrect

GottaCatchAllFD

HorLehTane

klinger_marten_fd

LucieKerstinFD

MMFUncDep

MyFd

PCFD

PutePute

RT_FD

SBMMFD

smart_data_cat-FD

Tsun12Fd

YuckFunc

0

200000

400000

600000

800000

1000000

1200000

1400000fd

_finke_dullw

eber

FD_schaeffer_zo…

FU

N

fastT

ane

Tsun12Fd

FD

_gru

ndke_w

iese

MM

FU

ncD

ep

aiw

endil

HorL

ehTane

Tane

klinger_

mart

en_fd

Dre

am

team

Fd

FuncD

ep

RT_FD

Pute

Pute

YuckFunc

FanctionalDepun…

PCFD

AlexoFredFunctio…

sm

art

_data

_cat-

FD

FrohnOttoFuncDe…

FD

FM

JR

Gott

aCatc

hAllFD

FdPerc

hykSchm

idt

dpdc-c

nm

s-f

d

dennis

_m

arius.f

d

fdep

SBM

MFD

Lucie

Kers

tinFD

MyFd

DJ_

FD

FD

_SPIR

O

FD

_Ju

ngRohlo

ff

FdBotheJoerkeRe…

FD

_Kirste

n_Zw

erg

FunctionalDepen…

Ru

nti

me [

ms]

Exercise 3

Runtime for abalone.csv

Chart 8

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 9

■ Rows: 4,177

■ FDs: 137

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000fd

_finke_dullw

eber

FD_schaeffer_zo…

FU

N

fastT

ane

Tsun12Fd

FD

_gru

ndke_w

iese

MM

FU

ncD

ep

aiw

endil

HorL

ehTane

Tane

klinger_

mart

en_fd

Dre

am

team

Fd

FuncD

ep

RT_FD

Pute

Pute

YuckFunc

FanctionalDepun…

PCFD

AlexoFredFunctio…

sm

art

_data

_cat-

FD

FrohnOttoFuncDe…

FD

FM

JR

Gott

aCatc

hAllFD

FdPerc

hykSchm

idt

dpdc-c

nm

s-f

d

dennis

_m

arius.f

d

fdep

SBM

MFD

Lucie

Kers

tinFD

MyFd

DJ_

FD

FD

_SPIR

O

FD

_Ju

ngRohlo

ff

FdBotheJoerkeRe…

FD

_Kirste

n_Zw

erg

FunctionalDepen…

Ru

nti

me [

ms]

Exercise 3

Runtime for abalone.csv (<10s)

Chart 9

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 9

■ Rows: 4,177

■ FDs: 137

0

100

200

300

400

500

600

700

800

900

1000fd

_finke_dullw

eber

FD_schaeffer_zo…

FU

N

fastT

ane

Tsun12Fd

FD

_gru

ndke_w

iese

MM

FU

ncD

ep

aiw

endil

HorL

ehTane

Tane

klinger_

mart

en_fd

Dre

am

team

Fd

FuncD

ep

RT_FD

Pute

Pute

YuckFunc

FanctionalDepun…

PCFD

AlexoFredFunctio…

sm

art

_data

_cat-

FD

FrohnOttoFuncDe…

FD

FM

JR

Gott

aCatc

hAllFD

FdPerc

hykSchm

idt

dpdc-c

nm

s-f

d

dennis

_m

arius.f

d

fdep

SBM

MFD

Lucie

Kers

tinFD

MyFd

DJ_

FD

FD

_SPIR

O

FD

_Ju

ngRohlo

ff

FdBotheJoerkeRe…

FD

_Kirste

n_Zw

erg

FunctionalDepen…

Ru

nti

me [

ms]

Exercise 3

Runtime for abalone.csv (<1s)

Chart 10

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 9

■ Rows: 4,177

■ FDs: 137

Exercise 3

Correctness for bridges.csv

Chart 11

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil

AlexoFredFunctionals

dennis_marius.fd

DJ_FD

dpdc-cnms-fd

DreamteamFd

FanctionalDepundancy

fastTane

fd_finke_dullweber

FD_grundke_wiese

FD_JungRohloff

FD_Kirsten_Zwerg

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus

FDFMJR

FdPerchykSchmidt

FrohnOttoFuncDep-LAME-TANE

FuncDep

FunctionalDependencyDetector

FunctionalDerpendency incorrect

GottaCatchAllFD

HorLehTane

klinger_marten_fd

LucieKerstinFD

MMFUncDep

MyFd

PCFD

PutePute

RT_FD

SBMMFD

smart_data_cat-FD

Tsun12Fd

YuckFunc

Exercise 3

Correctness for bridges.csv

Chart 12

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals

dennis_marius.fd

DJ_FD

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane

fd_finke_dullweber

FD_grundke_wiese

FD_JungRohloff

FD_Kirsten_Zwerg

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus

FDFMJR

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE

FuncDep

FunctionalDependencyDetector

FunctionalDerpendency incorrect

GottaCatchAllFD

HorLehTane

klinger_marten_fd

LucieKerstinFD

MMFUncDep

MyFd

PCFD

PutePute

RT_FD

SBMMFD

smart_data_cat-FD

Tsun12Fd

YuckFunc

Exercise 3

Runtime for bridges.csv

Chart 13

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

0

5000

10000

15000

20000

25000

30000

35000fd

ep

FuncD

ep

Tsun12Fd

fd_finke_dullw

eber

FD

_SPIR

O

MM

FU

ncD

ep

FD_schaeffer_zo…

FU

N

HorL

ehTane

FD

FM

JR

FanctionalDepun…

fastT

ane

RT_FD

sm

art

_data

_cat-

FD

Pute

Pute

Tane

klinger_

mart

en_fd

Dre

am

team

Fd

PCFD

FD

_gru

ndke_w

iese

FdBotheJoerkeRe…

AlexoFredFunctio…

FrohnOttoFuncDe…

YuckFunc

MyFd

FD

_Ju

ngRohlo

ff

FD

_Kirste

n_Zw

erg

Gott

aCatc

hAllFD

DJ_

FD

Lucie

Kers

tinFD

dennis

_m

arius.f

d

SBM

MFD

FunctionalDepen…

Ru

nti

me [

ms]

■ Columns: 13

■ Rows: 108

■ FDs: 142

Exercise 3

Runtime for bridges.csv (<1s)

Chart 14

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

0

100

200

300

400

500

600

700

800

900

1000fd

ep

FuncD

ep

Tsun12Fd

fd_finke_dullw

eber

FD

_SPIR

O

MM

FU

ncD

ep

FD_schaeffer_zo…

FU

N

HorL

ehTane

FD

FM

JR

FanctionalDepun…

fastT

ane

RT_FD

sm

art

_data

_cat-

FD

Pute

Pute

Tane

klinger_

mart

en_fd

Dre

am

team

Fd

PCFD

FD

_gru

ndke_w

iese

FdBotheJoerkeRe…

AlexoFredFunctio…

FrohnOttoFuncDe…

YuckFunc

MyFd

FD

_Ju

ngRohlo

ff

FD

_Kirste

n_Zw

erg

Gott

aCatc

hAllFD

DJ_

FD

Lucie

Kers

tinFD

dennis

_m

arius.f

d

SBM

MFD

FunctionalDepen…

Ru

nti

me [

ms]

■ Columns: 13

■ Rows: 108

■ FDs: 142

Exercise 3

Correctness for hepatitis.csv

Chart 15

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals

dennis_marius.fd

DJ_FD

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane

fd_finke_dullweber

FD_grundke_wiese

FD_JungRohloff

FD_Kirsten_Zwerg

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus

FDFMJR

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE

FuncDep

FunctionalDependencyDetector

FunctionalDerpendency incorrect

GottaCatchAllFD

HorLehTane

klinger_marten_fd

LucieKerstinFD

MMFUncDep

MyFd

PCFD

PutePute

RT_FD

SBMMFD

smart_data_cat-FD

Tsun12Fd

YuckFunc

Exercise 3

Correctness for hepatitis.csv

Chart 16

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals

dennis_marius.fd

DJ_FD

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane SerializationError

fd_finke_dullweber

FD_grundke_wiese

FD_JungRohloff

FD_Kirsten_Zwerg

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus

FDFMJR

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE

FuncDep

FunctionalDependencyDetector

FunctionalDerpendency incorrect GottaCatchAllFD

HorLehTane

klinger_marten_fd

LucieKerstinFD

MMFUncDep

MyFd

PCFD

PutePute

RT_FD

SBMMFD

smart_data_cat-FD

Tsun12Fd

YuckFunc

Exercise 3

Correctness for hepatitis.csv

Chart 17

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals > 1h ?

dennis_marius.fd > 1h ?

DJ_FD > 1h ?

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane SerializationError

fd_finke_dullweber

FD_grundke_wiese > 1h

FD_JungRohloff > 1h ?

FD_Kirsten_Zwerg > 1h ?

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus > 1h ?

FDFMJR > 1h

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep

FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane

klinger_marten_fd

LucieKerstinFD > 1h ? MMFUncDep

MyFd > 1h ? PCFD > 1h PutePute

RT_FD

SBMMFD > 1h ? smart_data_cat-FD

Tsun12Fd

YuckFunc > 1h

0

200000

400000

600000

800000

1000000

1200000

1400000

Ru

nti

me [

ms]

Exercise 3

Runtime for hepatitis.csv

Chart 18

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 20

■ Rows: 155

■ FDs: 8,250

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Ru

nti

me [

ms]

Exercise 3

Runtime for hepatitis.csv (<10s)

Chart 19

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 20

■ Rows: 155

■ FDs: 8,250

Exercise 3

Correctness for fd-reduced-15.csv

Chart 20

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals > 1h ?

dennis_marius.fd > 1h ?

DJ_FD > 1h ?

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane SerializationError

fd_finke_dullweber

FD_grundke_wiese > 1h

FD_JungRohloff > 1h ?

FD_Kirsten_Zwerg > 1h ?

FD_schaeffer_zoellner

FD_SPIRO

FdBotheJoerkeReissaus > 1h ?

FDFMJR > 1h

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep

FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane

klinger_marten_fd

LucieKerstinFD > 1h ? MMFUncDep

MyFd > 1h ? PCFD > 1h PutePute

RT_FD

SBMMFD > 1h ? smart_data_cat-FD

Tsun12Fd

YuckFunc > 1h

Exercise 3

Correctness for fd-reduced-15.csv

Chart 21

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals > 1h ?

dennis_marius.fd > 1h ?

DJ_FD > 1h ?

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane SerializationError

fd_finke_dullweber incorrect

FD_grundke_wiese > 1h

FD_JungRohloff > 1h ?

FD_Kirsten_Zwerg > 1h ?

FD_schaeffer_zoellner

FD_SPIRO > 30min

FdBotheJoerkeReissaus > 1h ?

FDFMJR > 1h

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep

FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane > 30min klinger_marten_fd

LucieKerstinFD > 1h ? MMFUncDep

MyFd > 1h ? PCFD > 1h PutePute

RT_FD > 30min SBMMFD > 1h ? smart_data_cat-FD

Tsun12Fd

YuckFunc > 1h

Exercise 3

Runtime for fd-reduced-15.csv

Chart 22

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 30

■ Rows: 250,000

■ FDs: 89,571

0

100

200

300

400

500

600

700

800

Ru

nti

me [

sec]

Exercise 3

Runtime for fd-reduced-15.csv (<30sec)

Chart 23

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

■ Columns: 30

■ Rows: 250,000

■ FDs: 89,571

0

5

10

15

20

25

30

Ru

nti

me [

sec]

Exercise 3

Correctness for plista1k.csv

Chart 24

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals > 1h ?

dennis_marius.fd > 1h ?

DJ_FD > 1h ?

dpdc-cnms-fd SerializationError

DreamteamFd

FanctionalDepundancy

fastTane SerializationError

fd_finke_dullweber Incorrect

FD_grundke_wiese > 1h

FD_JungRohloff > 1h ?

FD_Kirsten_Zwerg > 1h ?

FD_schaeffer_zoellner

FD_SPIRO > 30min

FdBotheJoerkeReissaus > 1h ?

FDFMJR > 1h

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep

FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane > 30min klinger_marten_fd

LucieKerstinFD > 1h ? MMFUncDep

MyFd > 1h ? PCFD > 1h PutePute

RT_FD > 30min SBMMFD > 1h ? smart_data_cat-FD

Tsun12Fd

YuckFunc > 1h

Exercise 3

Correctness for plista1k.csv

Chart 25

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

aiwendil incorrect

AlexoFredFunctionals > 1h ?

dennis_marius.fd > 1h ?

DJ_FD > 1h ?

dpdc-cnms-fd SerializationError

DreamteamFd > 1h ?

FanctionalDepundancy > 1h ?

fastTane SerializationError

fd_finke_dullweber Incorrect

FD_grundke_wiese > 1h

FD_JungRohloff > 1h ?

FD_Kirsten_Zwerg > 1h ?

FD_schaeffer_zoellner OutOfMemory

FD_SPIRO > 30min

FdBotheJoerkeReissaus > 1h ?

FDFMJR > 1h

FdPerchykSchmidt SerializationError

FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep

FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane > 30min klinger_marten_fd OutOfMemory LucieKerstinFD > 1h ? MMFUncDep OutOfMemory MyFd > 1h ? PCFD > 1h PutePute ArrayIndexOutOfBounds RT_FD > 30min SBMMFD > 1h ? smart_data_cat-FD > 1h Tsun12Fd > 1h YuckFunc > 1h

Exercise 3

Correctness for plista1k.csv

Chart 26

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Algorithm Runtime [ms]

FuncDep 7,043

fdep 18,492

TANE OutOfMemory

FUN OutOfMemory

Exercise 3

Short presentations – Part 2

Chart 27

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Functional

Dependencies

Data Cleansing

Duplicate Detection

Chart 28

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Exercise 4

Duplicate Detection

Chart 29

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Exercise 4

Duplicate Detection

Chart 30

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Exercise 4

Duplicate Detection

Chart 31

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Exercise 4

Duplicate Detection

Chart 32

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Exercise 4

Duplicate Detection

Chart 33

Thorsten Papenbrock, PhD Candidate, 17th November, 2014

Data Profiling with Metanome

Data Cleansing Exercise: Duplicate Detection

Thorsten Papenbrock

PhD Candidate

Hasso-Plattner-Institute