Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and...
Transcript of Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and...
![Page 1: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/1.jpg)
Qualitative Data Cleaning
Xu Chu Ihab Ilyas
![Page 2: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/2.jpg)
Many Definitions and One Goal
“Extract Value from Data”
p For that we ..n Remove errorsn Fill missing infon Transform units and formatsn Map and align columnsn Remove duplicate recordsn Fix integrity constraints violations
2
![Page 3: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/3.jpg)
3
For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights
NYtimes August, 2014
Yes big data is a big business opportunity, but the business value won’t be realized if the information isn’t governed
Forbes Business
![Page 4: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/4.jpg)
Many Technical Challenges
p Record Linkage and Deduplication
n Similarity measures n Machine learning for classifying pairs as duplicates or
not (unsupervised, supervised, and active) n Clustering and handling of transitivity n Merging and consolidation of records
A major firm spends 6 months on a single deduplicationproject of a subset of their data sources
4
![Page 5: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/5.jpg)
5
Example: Data Deduplication
ID name ZIP Income
P1 Green 51519 30k
P2 Green 51518 32k
P3 Peter 30528 40k
P4 Peter 30528 40k
P5 Gree 51519 55k
P6 Chuck 51519 30k
ID name ZIP Income
C1 Green 51519 39k
C2 Peter 30528 40k
C3 Chuck 51519 30k
Compute Pair-wiseSimilarity
P1 P2
P3 P4P5
P60.3 0.5
0.9
1.0
Cluster Similar Records
P1 P2
P3 P4P5
P6Merge
Clusters C1 C3
C2
Unclean Relation
Clean Relation
![Page 6: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/6.jpg)
Many Technical Challenges
p Missing Values
n Interpreting different types of Nullsn Certain answer semantics on possible worlds (many..
many papers) n Closed world vs. open-world assumptions and multiple
interesting hardness results
Most real data collected from sensors, surveys, agents,have a high percentage of N/A or nulls, special values(99999) etc.
6
![Page 7: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/7.jpg)
Many Technical Challenges
p More Complex Integrity Constraints
n A declarative language to express data quality rulesn Ad-hoc repair algorithm to repair violations for each data
quality formalism under certain minimality requirementsn Limited expressiveness (e.g., FD) to get tangible results
Unfortunately rarely expressed in practice. Most curationtools are rule-based implemented in imperative language
7
![Page 8: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/8.jpg)
Example ICs
8
ID FN LN ROLE CITY ST SAL
105 Anne Nash M NYC NY 110
211 Mark White E SJ CA 80
386 Mark Lee E NYC AZ 75
235 John Smith M NYC NY 1200
Functional dependency:
t1t2t3
t4
Employee Table
City→ ST
![Page 9: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/9.jpg)
Example ICs
9
ID FN LN ROLE CITY ST SAL
105 Anne Nash M NYC NY 110
211 Mark White E SJ CA 80
386 Mark Lee E NYC AZ 75
235 John Smith M NYC NY 1200
t1t2t3
t4
Employee Table
Business Rule: Two employees of the same role, the one who lives in NYC cannot earn less than the one who does not live in NYC
![Page 10: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/10.jpg)
Example ICs
10
ID FN LN ROLE CITY ST SAL
105 Anne Nash M NYC NY 110
211 Mark White E SJ CA 80
386 Mark Lee E NYC AZ 75
235 John Smith M NYC NY 1200
t1t2t3
t4
Employee Table
Business Rule: Two employees of the same role in the same city, their salary difference cannot be greater than 100
![Page 11: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/11.jpg)
ID Name ZIP City State Income1 Green 60610 Chicago IL 30k2 Green 60611 Chicago IL 32k3 Peter New Yrk NY 40k4 John 11507 New York NY 40k5 Gree 90057 Los Angeles CA 55k6 Chuck 90057 San Francisco CA 30k
Common Data Quality Issues
11
Duplicates
Syntactic ErrorIntegrity Constraint Violation
Missing Value
1 Green 60610 Chicago IL 31k
11507 New York
Los Angeles
![Page 12: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/12.jpg)
Data Cleaning Process
p Error Detectionn Qualitative n Quantitative (outlier detection)
p Error Repairingn Transformation scriptsn Human involvement
12
![Page 13: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/13.jpg)
We Will Not Cover
p Details of Deduplicationn Multiple surveys and tutorials
p Data Profiling: discovering FDs, INDs, etc.n Wenfei Fan and Floris Geerts synthesis lecture
book n Ziawasch Abedjan et al. tutorial
p Consistent Query Answeringn Leo Betrossi synthesis lecture book
13
![Page 14: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/14.jpg)
Error Detection Techniques Taxonomy
14
Qualitative Error Detection Techniques
Error Type(What to detect?)
IC Data deduplication
CFDFD
Automation(How to detect?)
Automatic Human guided
Analytics Layer(Where to detect?)
Source Target
DC Others
[Ilyas and Chu, Foundations and Trends in Database Systems, 2015]
![Page 15: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/15.jpg)
Error Detection Techniques Taxonomy
15
Qualitative Error Detection Techniques
Error Type(What to detect?)
IC Data deduplication
CFDFD
Automation(How to detect?)
Automatic Human guided
Analytics Layer(Where to detect?)
Source Target
DC Others
![Page 16: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/16.jpg)
FDs and CFDs [Bohannon et al, ICDE 2007]
p Functional Dependency (FD): X → Yn Example: City → ST or Name,Phone → ID
p Conditional Functional Dependency (CFD): (X → Y, Tp)n An FD defined on a subset of the datan Example:
p ZIP → Street is valid on subset of the data where Country = “England”
p AC = 020 à City = London
16
![Page 17: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/17.jpg)
Matching Dependencies (MDs) [Fan et al, VLDB 2009]
17
FN LN St City AC Post Phn Item
Robrt Brady 5WrenSt Ldn 020 WC1H9SE 3887834 watchRobert Brady Null Ldn 020 WC1E7HX 3887644 necklace
FN LN St City AC Zip TelRobert Brady 5WrenSt Ldn 020 WC1H9SE 3887644
Tran
Master: Card
MD: Tran[LN, City, St, Post] = card[LN, City, St, Zip] ^Tran[FN] ≈ Card[FN] àTran[FN, Phn] ó Card[FN, Tel]
Robert 3887644
[Fan et al, SIGMOD 2011]
![Page 18: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/18.jpg)
Denial Constraints (DCs) [Chu et al, VLDB 2013]
18
Ø A universal constraint dictates a set of predicate cannot be true together
Ø Each predicate express a relationship between two cells, or a cell and a constant
Formal Definition:
![Page 19: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/19.jpg)
Denial Constraints (DCs)
19
Functional dependency: 𝐶𝐼𝑇𝑌 ⇒ 𝑆𝑇
Business Rule: Two employees of the same Role, the one who lives in NYC cannot earn less than the one who does not live in NYC
![Page 20: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/20.jpg)
Other ICs
p CINDs [Ma et al, TCS 2014]
p Metric Functional Dependencies [Koudas et al, ICDE 2009]
p Dependable Fixesn Editing Rules [Fan et al, VLDB 2010]
n Fixing Rules [Wang and Tang, SIGMOD 2014]
n Sherlock Rules [Interlandi and Tang, ICDE 2015]
20
![Page 21: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/21.jpg)
Constraint Languages
21
Language expressiveness
FDs CFDs … Programmatic InterfaceDCs
Reasoning and discovery complexity
![Page 22: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/22.jpg)
Integrity Constraints Discovery
p Schema Drivenn Usually sensitive to the size of the scheman Good for long thin tables!
p Instance Drivenn Usually sensitive to the size of the datan Good for fat short tables!
p Hybridn Try to get the best of both worlds
22
![Page 23: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/23.jpg)
Integrity Constraints Discovery
p FD Discovery:n TANE: Schema-driven
p [Huhtala et al, Computer Journal 1999]
n FASTFD: Instance-driven p [Wyss et al, DaWaK, 2001]
n Hybridp [Papenbrock et al, SIGMOD 2016]
p DC Discovery:n FASTDC: Instance-driven [Chu et al, VLDB 2013]
23
![Page 24: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/24.jpg)
Integrity Constraints Discovery
p FD Discovery:n TANE: Schema-driven
p [Huhtala et al, Computer Journal 1999]
n FASTFD: Instance-driven p [Wyss et al, DaWaK, 2001]
n Hybridp [Papenbrock et al, SIGMOD 2016]
p DC Discovery:n FASTDC: Instance-driven [Chu et al, VLDB 2013]
24
![Page 25: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/25.jpg)
FD Discovery
p Given a relational instance I of schema R, where |R| = m, find (all) minimal, non-trivial FDs that are valid on I. An FD is n Valid on I if there does not exist two tuples
that violate the FDn Minimal if removing an attribute from its LHS
makes it invalidn Trivial if the RHS is a subset of the LHS
p We want FDs with only one attribute in RHS
25
![Page 26: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/26.jpg)
TANE [Huhtala et al, Computer Journal 1999]
p Generate space of FDs
26
![Page 27: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/27.jpg)
TANE
p FD Validation
27
X→Y is a valid FD if and only if
![Page 28: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/28.jpg)
DC Discovery: Axioms
28
∀tα ,tβ ∈R,!(tα .SAL = tβ .SAL ∧ tα .SAL > tβ .SAL)
Triviality
is a trivial DC
![Page 29: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/29.jpg)
DC Discovery: Axioms
29
∀tα ,tβ ∈R,!(tα .ZIP = tβ .ZIP ∧ tα .ST ≠ tβ .ST )
∀tα ,tβ ∈R,!(tα .ZIP = tβ .ZIP ∧ tα .ST ≠ tβ .ST ∧ tα .SAL < tβ .SAL)
Augmentation
If !(P1 ∧…∧ Pn ) is valid, then !(P1 ∧…∧ Pn ∧Q) is also valid
Not Minimal
![Page 30: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/30.jpg)
DC Discovery: Axioms
30
Transitivity (more like composition)
If !(P1 ∧…∧ Pn ∧Q1), and !(R1 ∧…∧ Rm ∧Q2 ) are valid, and Q2 ∈Imp(Q1), then!(P1 ∧…∧ Pn ∧ R1 ∧…∧ Rm ) is valid
∀tα ,tβ ∈R,!(tα .ST = tβ .ST ∧ tα .SAL < tβ .SAL ∧ tα .TR > tβ .TR)
∀tα ,tβ ∈R,!(tα .ZIP = tβ .ZIP ∧ tα .SAL < tβ .SAL ∧ tα .TR > tβ .TR)
∀tα ,tβ ∈R,!(tα .ZIP = tβ .ZIP ∧ tα .ST ≠ tβ .ST )
Q1 Q2
![Page 31: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/31.jpg)
DC Discovery
31
Given a relational schema R and an instance I, find all non-trivial, minimal DCs that hold on I
Focus on DCs involving at most two tuples
![Page 32: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/32.jpg)
FASTDC [Chu et al, VLDB 2013]
p The space of predicates
p Any combination of predicates constitutes a candidate DC
32
P1 : tα .I = tβ .IP2 : tα .I ≠ tβ .I
P3 : tα .M = tβ .MP4 : tα .M ≠ tβ .M
P5 : tα .S = tβ .SP6 : tα .S ≠ tβ .SP7 : tα .S > tβ .SP8 : tα .S ≤ tβ .SP9 : tα .S < tβ .SP10 : tα .S ≥ tβ .S
P11 : tα .I = tα .MP12 : tα .I ≠ tα .MP13 : tα .I = tβ .MP14 : tα .I ≠ tβ .M
Employee
![Page 33: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/33.jpg)
FASTDC
33
!(Pi ∧ Pj ∧ Pk ) is a valid DC w.r.t. I
For every tuple pair in I , at least one of is falsePi ,Pj ,Pk
For every tuple pair in I , at least one of is truePi ,Pj ,Pk
For every tuple pair in I , cannot be true togetherPi ,Pj ,Pk
covers the set of true predicates for every tuple pair Pi ,Pj ,Pk
![Page 34: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/34.jpg)
FASTDC
34
{P2,P3,P5,P8,P10,P12,P14}
{P2,P3,P6,P8,P9,P12,P14}{P2,P3,P6,P7,P10,P11,P13}
< t2,t3 >,< t3,t2 >
< t1,t2 >,< t1,t3 >
< t2,t1 >,< t3,t1 >
{P10,P14} covers the set of true predicates for every tuple pair
∀tα ,tβ ∈R,!(tα .S < tβ .S ∧ tα .I = tβ .M ) is a valid DC
{P5,P10,P14}covers the set of true predicates for every tuple pair
!(P10 ∧ P14 ∧ P5 ) is a valid DC, but not minimal
EviI
![Page 35: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/35.jpg)
FASTDC
35
P2
: valid DC
: pruned branch
: invalid DC
P3 P6 P8 P10 P12 P14 P5 P11 P13
P11 P13P8 P10 P12 P14 P5 P10 P11 P13 P12 P14 P11 P13 P11 P13
{P2,P3,P5,P8,P10,P12,P14}
{P2,P3,P6,P8,P9,P12,P14}{P2,P3,P6,P7,P10,P11,P13}
EviI
![Page 36: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/36.jpg)
FASTDC
36
∀tα ,tβ ∈R,!(tα .ST = tβ .ST ∧ tα .SAL < tβ .SAL ∧ tα .TR > tβ .TR)
∀tα ∈R,!(tα .SAL < tα .STX)
∀tα ∈R,!(tα .CT = Los Angeles∧ tα .ST ≠ CA)
∀tα ,tβ ∈R,!(tα .ZIP = tβ .ZIP ∧ tα .ST ≠ tβ .ST )
∀tα ∈R,!(tα .MS ≠ S ∧ tα .MS ≠ M )
∀tα ,tβ ∈R,!(tα .AC = tβ .AC ∧ tα .PH = tβ .PH )Key :{AC,PH}
Domain :MS ∈{S,M}
FD :ZIP→ STCFD :CT = Los Angeles→ ST = CA
Check :SAL ≥ STX
Business logic
![Page 37: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/37.jpg)
Error Detection Techniques Taxonomy
37
Qualitative Error Detection Techniques
Error Type(What to detect?)
IC Data deduplication
CFDFD
Automation(How to detect?)
Automatic Human guided
Analytics Layer(Where to detect?)
Source Target
DC Others
![Page 38: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/38.jpg)
Error Detection Techniques Taxonomy
38
Qualitative Error Detection Techniques
Error Type(What to detect?)
IC Data deduplication
CFDFD
Automation(How to detect?)
Automatic Human guided
Analytics Layer(Where to detect?)
Source Target
DC Others
![Page 39: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/39.jpg)
Holistic Error Detection
p Vertex: Cell in the databasep Hyperedge: A set of cells that violate a DC
39
ID FN LN ROLE ZIP ST SAL
105 Anne Nash E 85376 NY 110
211 Mark White M 90012 NY 80
386 Mark Lee E 85376 AZ 75
t1t2t3
Employee Tablet1.ST
e1
t3.ZIP t3.ST
t1.ZIPZip à ST
[Chu et al, ICDE 2013][Kolahi and Lakshmanan ICDT 2009]
![Page 40: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/40.jpg)
Holistic Error Detection
p Vertex: Cell in the databasep Hyperedge: A set of cells that violate a DC
40
ID FN LN ROLE ZIP ST SAL
105 Anne Nash E 85376 NY 110
211 Mark White M 90012 NY 80
386 Mark Lee E 85376 AZ 75
t1t2t3
Employee Tablet1.ROLE t1.ST
t2.STt2.ROLE
t1.SAL
t2.SAL
e2
![Page 41: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/41.jpg)
Holistic Error Detection
p Vertex: Cell in the databasep Hyperedge: A set of cells that violate a DC
41
ID FN LN ROLE ZIP ST SAL
105 Anne Nash E 85376 NY 110
211 Mark White M 90012 NY 80
386 Mark Lee E 85376 AZ 75
t1t2t3
Employee Tablet1.ROLE t1.ST
t2.ST
e1
t2.ROLE
t1.SAL
t2.SAL
t3.ZIP t3.ST
t1.ZIP
e2
Zip à ST
![Page 42: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/42.jpg)
Error Detection Techniques Taxonomy
42
Qualitative Error Detection Techniques
Error Type(What to detect?)
IC Data deduplication
CFDFD
Automation(How to detect?)
Automatic Human guided
Analytics Layer(Where to detect?)
Source Target
DC Others
![Page 43: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/43.jpg)
CrowdER: [Wang et al, VLDB 2012]
p Human-Intelligence Task (HIT)
43
O(n2) X
![Page 44: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/44.jpg)
CrowdER: Batching Strategies
p Pair-based HIT
44
O(n2/k) X
![Page 45: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/45.jpg)
CrowdER: Batching Strategies
p Cluster-based HIT
45
O(n2/k2) X
![Page 46: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/46.jpg)
CrowdER: Workflow
46
(r1, r2) (r4, r6)
YES NOYES NO
(r1, r7) (r3, r4)
YES NOYES NO
(r4, r7) (r8, r9)
YES NOYES NO
(r2, r3) (r2, r7)
YES NOYES NO
(r3, r5) (r4, r5)
YES NOYES NO
(r1, r2) (r1, r7) (r3, r4) (r2, r7)
(r1, r2 , 0.90)(r4, r6 , 0.85)(r1, r7 , 0.82)(r3, r4 , 0.76)(r4, r7 , 0.70)(r8, r9 , 0.55)(r2, r3 , 0.45)(r2, r7 , 0.35)(r3, r5 , 0.31)(r4, r5 , 0.20)(r3, r6 , 0.15)(r1, r3 , 0.10)
...
0.2
(a) Remove the pairs whoselikelihoods < 0.2
(b) Generate HITs to verify the pairs of records
(c) Output matching pairs
![Page 47: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/47.jpg)
CrowdER: Workflow(r1, r2 , 0.90)(r4, r6 , 0.85)(r1, r7 , 0.82)(r3, r4 , 0.76)(r4, r7 , 0.70)(r8, r9 , 0.55)(r2, r3 , 0.45)(r2, r7 , 0.35)(r3, r5 , 0.31)(r4, r5 , 0.20)(r3, r6 , 0.15)(r1, r3 , 0.10)
...
0.2
r7 r4 r6
r1 r2 r3 r5
r8 r9
r1r2r4r7
r3r4r5r6
r2r3r8r9
Cluster-size threshold kMinimize the number of HITs
NP-Hard HIT1 HIT2 HIT3
Cluster-based HIT
46
![Page 48: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/48.jpg)
Error Detection Techniques Taxonomy
48
Qualitative Error Detection Techniques
Error Type(What to detect?)
IC Data deduplication
CFDFD
Automation(How to detect?)
Automatic Human guided
Analytics Layer(Where to detect?)
Source Target
DC Others
![Page 49: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/49.jpg)
Decoupled in Space and Time
49
pdfwww
Employees
T(1) In the same shop, the average salary for the managers (Grd=2) should be higher than the one for the staff (Grd=1)(2) A bigger shop cannot have a smaller number of staff
(3) Phone number must have country code and local number
(4) S1.NAME is NOT NULL(5) length(S3.NAME) < 30
word
Extraction
S1
Shops
S2 S3
Transformation
Transformation
q(t↵.Shop = t� .Shop^t↵.Avgsal > t� .Avgsal
^t↵.Grd < t� .Grd)
q(t↵.Size > t� .Size^t↵.#Emps < t� .#Emps)
![Page 50: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/50.jpg)
Calls for a New Solution
50
Target
Targ
etSou
rce
SourceError Fixing
Con
stra
ints
Dec
lara
tion Traditional
Data Repair Algorithms
Descriptive and
Prescriptive Data
Cleaning
Dependency Propagation
TraditionalData Repair Algorithms
• DBRx: [Chalamalla et al., SIGMOD 2014] • DataXRay: [Wang et al., SIGMOD 2015]• QOCO: [Bergman et al., VLDB 2015]
![Page 51: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/51.jpg)
DBRx Architecture [Chalamalla et al, SIGMOD 2014]
51
![Page 52: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/52.jpg)
Technical Challenges
p Errors Propagationn Blowup (e.g., Aggregates)n Propagation Level (violations vs Fixes)n Distributing Responsibilities
p Source Error Identificationn Assign Weights based on Query and Error Semanticsn Accumulate Evidences (different Violation Semantics)
p Explain Errors
52
![Page 53: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/53.jpg)
Tracing the Sources of Errors
53
Emps EId Name Dept Sal Grd SId JoinYr
t1 e4 John S 91 1 NY1 2012
t2 e5 Anne D 99 2 NY1 2012
t3 e7 Mark S 93 1 NY1 2012
t4 e8 Claire S 116 1 NY1 2012
t5 e11 Ian R 89 1 NY2 2012
t6 e13 Laure R 94 2 NY2 2012
t7 e14 Mary E 91 1 NY2 2012
t8 e18 Bill D 98 2 NY2 2012
t9 e14 Mike R 94 2 LA1 2011
t10 e18 Claire E 116 2 LA1 2011
Shops SId City State Size Start
t11 NY1 NYC NY 46 ft2 2011
t12 NY2 NYC NY 62 ft2 2012
t13 LA1 LA CA 35 ft2 2011
T Shop Size Grd AvgSal #Emps Region
ta NY1 46 ft2 2 99 $ 1 US
tb NY1 46 ft2 1 100 $ 3 US
tc NY2 62 ft2 2 96 $ 2 US
td NY2 62 ft2 1 90 $ 2 US
te LA1 35 ft2 2 105 $ 2 US
tf LND 38 ft2 1 65 £ 2 EU
Average salary of higher grade in the same shop should be higher!2?
SELECT Shops.SId as Shop, Size, Emps.Grd, AVG(Emps.Sal) as AvgSal, COUNT(EId) as #Emps, ‘US’ as Region
FROM US.Emps JOIN US.Shops ON SidGROUP BY SId, Size, Grd
![Page 54: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/54.jpg)
Error Contribution Scores
54
Emps EId [CSV] Sal [CSV] Grd [CSV] SId[CSV] [RSV]t1 e4 [‘’, 13 ] 91 [ 91
300 ,‘’] 1 [ 13 ,13 ] NY1 [ 13 ,‘’] [0,1]
t2 e5 99 [0,‘’] 2 [1,‘’] NY1 [1,‘’] [1,‘’]t3 e7 [‘’, 13 ] 93 [ 93
300 ,‘’] 1 [ 13 ,13 ] NY1 [ 13 ,‘’] [0,1]
t4 e8 [‘’, 13 ] 116 [ 116300 ,‘’] 1 [ 13 ,13 ] NY1 [ 13 ,‘’] [1,1]
t5 e11 [‘’, 12 ] 89 1 [‘’, 12 ] NY2 [‘’,0]t6 e13 94 2 NY2 []t7 e14 [‘’, 12 ] 91 1 [‘’, 12 ] NY2 [‘’,0]t8 e18 98 2 NY2 []t9 e14 94 2 LA1 []t10 e18 116 2 LA1 []
Shops SId [CSV] Size [CSV] [RSV]t12 NY1 [2,‘’] 46 [‘’,1] [1,1]t13 NY2 62 [‘’,1] [‘’,1]t14 LA1 35 []
csv(c):Contribution of this cell to the aggregate
rsv(t): Removing t4 eliminates the violations
![Page 55: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/55.jpg)
Identifying Likely Errors
p Maximize a gain function of adding more source errors
55
tid Score
s1 0.67
s2 0.54
s3 0.47
s4 0.08
s5 0.06
s6 0.05
Gain = 1.08 Gain = 1.28 Gain = -0.08
tid Score
s1 0.67
s2 0.54
s3 0.47
s4 0.08
s5 0.06
s6 0.05
tid Score
s1 0.67
s2 0.54
s3 0.47
s4 0.08
s5 0.06
s6 0.05
cv(s) = csv(s) + rsv(s)
Gain(Hv) =X
s2Hv
cv(s)�X
1j|Hv|
X
j<k|Hv|
D(sj , sk)
D(sj , sk) = |cv(sj)� cv(sk)|
![Page 56: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/56.jpg)
Explanation Recall Precision ConciseLow High ConciseHigh High Verbose
High Low Concise
Error Explanation
56
Like
ly E
rror
Tup
les
Poss
ible
Ex
plan
atio
ns
eid = e4 _ eid = e7_eid = e8 _ eid = e14
Dept = s
Grd = 1
Emps EId Name Dept Sal Grd SId JoinYr
t1 e4 John S 91 1 NY1 2012
t2 e5 Anne D 99 2 NY1 2012
t3 e7 Mark S 93 1 NY1 2012
t4 e8 Claire S 116 1 NY1 2012
t5 e11 Ian R 89 1 NY2 2012
t6 e13 Laure R 94 2 NY2 2012
t7 e14 Mary E 91 1 NY2 2012
t8 e18 Bill D 98 2 NY2 2012
t9 e14 Mike R 94 2 LA1 2011
t10 e18 Claire E 116 2 LA1 2011
![Page 57: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/57.jpg)
Data Repairing
57
![Page 58: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/58.jpg)
Data Repairing Techniques Taxonomy
58
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
[Ilyas and Chu, Foundations and Trends in Database Systems, 2015]
HolisticPiece-meal
![Page 59: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/59.jpg)
Data Repairing Techniques Taxonomy
59
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 60: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/60.jpg)
Repair Automation
p Most automatic repairing techniques adopt the “minimality” of repairs principle
p Repairing techniques in practice are predominantly manual and semi-automatic at best
p Will survey both
60
![Page 61: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/61.jpg)
Data Repairing Techniques Taxonomy
61
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 62: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/62.jpg)
Data Repairing Techniques Taxonomy
62
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 63: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/63.jpg)
63
Data Repair by Value Update
p I is a dirty database if I ⊭ Σ, and Ij is a repair for I if Ij ⊨ Σ
p For a repair Ij, Δ(Ij) is the set of changed cells in Ij
A Bt1 1 2
2t 1 3t3 1 3t4 4 5
I I1 I2
Δ(I1) = {t1[B]} Δ(I2) = {t2[B], t3[B]}Σ = {Aà B}
A Bt1 1 3
2t 1 3t3 1 3t4 4 5
A Bt1 1 2
2t 1 2t3 1 2t4 4 5
![Page 64: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/64.jpg)
Data Only Repairing
p Cardinality-Minimal repairsn Commonly used in obtaining a single repair automaticallyn Repairs with the minimum number of changesn I1 is Card-Min iff ∄I2 s.t. |Δ(I2)| < |Δ(I1)|
64
A Bt1 1 2
2t 1 3t3 1 3t4 4 5
A B1 31 31 34 5
A B1 21 21 24 5
A B1 51 51 54 5
A B7 31 31 34 5
FD: {Aà B}
I1 I2 I3 I4
![Page 65: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/65.jpg)
Data Repairing Techniques Taxonomy
65
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 66: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/66.jpg)
FD Repairing [Bohannon et al, SIGMOD 2005]
66
A Bt1 1 2
2t 1 3t3 1 3t4 4 5
FD: {Aà B}
A Bt1 1 2
2t 1 3t3 1 3t4 4 5
FD: {Aà B}
A Bt1 1 3
2t 1 3t3 1 3t4 4 5
BuildingEquivalence
Classes
Resolving Equivalence
Classes
FD: {Aà B}
![Page 67: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/67.jpg)
Data Repairing Techniques Taxonomy
67
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 68: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/68.jpg)
Holistic Data Repairing [Chu et al, ICDE 2013]
p Vertex: Cell in the databasep Hyperedge: A set of cells that violate a DC
68
ID FN LN ROLE ZIP ST SAL
105 Anne Nash E 85376 NY 110
211 Mark White M 90012 NY 80
386 Mark Lee E 85376 AZ 75
t1t2t3
Employee Tablet1.ROLE t1.ST
t2.ST
e1
t2.ROLE
t1.SAL
t2.SAL
t3.ZIP t3.ST
t1.ZIP
e2
Zip à ST
![Page 69: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/69.jpg)
Step1: Minimal Vertex Cover
69
ID FN LN ROLE ZIP ST SAL
105 Anne Nash E 85376 NY 110
211 Mark White M 90012 NY 80
386 Mark Lee E 85376 AZ 75
t1t2t3
Employee Tablet1.ROLE t1.ST
t2.ST
e1
t2.ROLE
t1.SAL
t2.SAL
t3.ZIP t3.ST
t1.ZIP
e2
p A minimal set of vertices that are intersecting with every hyperedge
Zip à ST
![Page 70: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/70.jpg)
Step2: Collect Repair Requirements
70
t1.ROLE t1.ST
t2.ST
e1
t2.ROLE
t1.SAL
t2.SAL
t3.ZIP t3.ST
t1.ZIP
e2
p A set of conditions that need to be satisfied to resolve all violations
Condition to resolve e1 by changing t1.ST:
t1.ST = t3.ST
Condition to resolve e2 by changing t1.ST:
t1.ST != t2.STZip à ST
![Page 71: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/71.jpg)
Step3: Get Updates
71
p A set of assignments satisfying the conditions, with minimal number of changed cells
t1.ST = t3.STt1.ST != t2.ST
ID FN LN ROLE ZIP ST SAL
105 Anne Nash E 85376 NY 110
211 Mark White M 90012 NY 80
386 Mark Lee E 85376 AZ 75
t1t2t3
AZGradually increase the number of cells that are going to be changed, until reach a solution
Suppose we only want to change t1.STt2.ST = NYt3.ST = AZ
![Page 72: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/72.jpg)
More Holistic Data Repairing [Fan et al, SIGMOD 2011]
72
FN LN St City AC Post Phn ItemBob Brady 5WrenSt Edi 020 WC1H9SE 3887834 watchRobert Brady Null Ldn 020 WC1E7HX 3887644 necklace
FN LN St City AC Zip TelRobert Brady 5WrenSt Ldn 020 WC1H9SE 3887644
Tran
Master: Card
CFD: Tran( AC = 020 à City = Lnd)
Ldn
CFD: Tran( FN = Bob à FN = Robert)
Robert
MD: Tran[LN, City, St, Post] = card[LN, City, St, Zip] ^Tran[FN] ≈ Card[FN] àTran[FN, Phn] ó Card[FN, Tel]
3887644
FD: Tran( City, Phn à St, AC, Post)
5 Wren St
![Page 73: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/73.jpg)
Data Repairing Techniques Taxonomy
73
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 74: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/74.jpg)
Data & Rules Repairing: Motivating Example
p Car Databasen Model à Make was satisfied by Car databases till Mazda
323 was introduced (Conflicting with BMW 323)n Could be corrected to Model, Cylindersà Make
p US presidents Databasen LastName, FirstName à StartYear, EndYear was
satisfied till the election of George W. Bushn Should be corrected to LastName, MiddleInit, FirstName
à StartYear, EndYear
74
[Chiang and Miller, ICDE 2011][Beskales et al, ICDE 2013]
![Page 75: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/75.jpg)
Relative Trustp In reality, both data and constraints (FDs) can be
wrong
p The relative trust in data vs. FDs determines how we should repair data and FDs
75
![Page 76: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/76.jpg)
GivenName Surname BirthDate Gender Phone Income
Danielle Blake 9 Dec 1970 Female 817-213-1211 120k
Danielle Blake 9 Dec 1970 Female 817-988-9211 100k
Hong Li 27 Oct 1972 Female 591-977-1244 90k
Hong Li 8 Mar 1979 Female 498-214-5822 84k
Ning Wu 3 Nov 1982 Male 313-134-9241 90k
Ning Wu 8 Nov 1982 Male 323-456-3452 95k
Example
76
Surname, GivenNameà Income
t1
t2t3
t4
t5
t6
![Page 77: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/77.jpg)
GivenName Surname BirthDate Gender Phone Income
Danielle Blake 9 Dec 1970 Female 817-213-1211 120k
Danielle Blake 9 Dec 1970 Female 817-988-9211 120k
Hong Li 27 Oct 1972 Female 591-977-1244 90k
Hong Li 8 Mar 1979 Female 498-214-5822 90k
Ning Wu 3 Nov 1982 Male 313-134-9241 95k
Ning Wu 8 Nov 1982 Male 323-456-3452 95k
Example (Trusted FD)
77
Surname, GivenNameà Income
t1
t2t3
t4
t5
t6
![Page 78: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/78.jpg)
Example (Trusted Data)
78
Surname, GivenName, BirthDate, Phone à Income
GivenName Surname BirthDate Gender Phone Income
Danielle Blake 9 Dec 1970 Female 817-213-1211 120k
Danielle Blake 9 Dec 1970 Female 817-988-9211 100k
Hong Li 27 Oct 1972 Female 591-977-1244 90k
Hong Li 8 Mar 1979 Female 498-214-5822 84k
Ning Wu 3 Nov 1982 Male 313-134-9241 90k
Ning Wu 8 Nov 1982 Male 323-456-3452 95k
t1
t2t3
t4
t5
t6
![Page 79: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/79.jpg)
GivenName Surname BirthDate Gender Phone Income
Danielle Blake 9 Dec 1970 Female 817-213-1211 120k
Danielle Blake 9 Dec 1970 Female 817-988-9211 120k
Hong Li 27 Oct 1972 Female 591-977-1244 90k
Hong Li 8 Mar 1979 Female 498-214-5822 84k
Ning Wu 3 Nov 1982 Male 313-134-9241 90k
Ning Wu 8 Nov 1982 Male 323-456-3452 95k
Example (Equally-trusted Data and FD)
79
Surname, GivenName, BirthDateà Income
t1
t2t3
t4
t5
t6
![Page 80: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/80.jpg)
Data Repair
p We repair instance I by modifying multiple cells and produce I’
p distd(I,I’) is the number of different cells between I and I’
80
![Page 81: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/81.jpg)
Repairing a set of FDs
p We repair an FD XàA by adding one or more attributes to the LHS
p Let w(Y) be a weight reflecting the penalty of adding attribute set Y to Xn E.g., the number of attributes in Y, distinct values of Y in
I, entropy of Y
p Let distc(Σ,Σ’) be the sum of w(Y) across all changed FDs
81
![Page 82: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/82.jpg)
Relative Trust [Beskales et al, ICDE 2013]
82
distc(Σ,Σ’)
distd(I,I’) Maximum number of cell changes (𝛕)
(I’, Σ’): I’ ⊨ Σ’
![Page 83: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/83.jpg)
A Unified Cost Model [Chiang and Miller, ICDE 2011]
p Minimum description Length Principlen Find a model M w.r.t. Σ that can represent the
data as much as possible
p DL(M) = L(M) + L(I|M)n L(M): Length of the modeln L(I|M): Length of data given M
83
![Page 84: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/84.jpg)
A Unified Cost Model: Data Repair
p M: emptyn L(M) = 0n L(I|M) = 27n DL = 27
p M:n L(M) = 3+2*6 = 15n L(I|M) = 0n DL = 15
84
District Region ACBrook Granville 412
Brook Granville 412
Brook Granville 412
Brook Granville 553
Brook Granville 553
Brook Granville 553
Brook Granville 725
Brook Granville 725
Brook Granville 725
FD: {District, Region àAC}
Brook Granville 412 --- 412--- 412--- 412--- 412--- 412--- 412
![Page 85: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/85.jpg)
A Unified Cost Model: FD Repair
p M: emptyn L(M) = 0n L(I|M) = 36n DL = 36
p M:
n L(M) = 12n L(I|M) = 0n DL = 12
85
Municipal District Region ACGlendale Brook Granville 412
Glendale Brook Granville 412
Glendale Brook Granville 412
Guildwood Brook Granville 553
Guildwood Brook Granville 553
Guildwood Brook Granville 553
Moore Brook Granville 725
Moore Brook Granville 725
Moore Brook Granville 725
FD: {Municipal, District, Region àAC}
Glendale Brook Granville 412
Guildwood Brook Granville 553
Moore Brook Granville 725
![Page 86: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/86.jpg)
Data Repairing Techniques Taxonomy
86
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 87: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/87.jpg)
Guided Data Repair (GDR) [Yakout et al, VLDB 2011]
87
DB
ICsGenerate Possible Repairs
GroupPossibleRepairs
RankGroups
Verified Updates
LearningComponent
![Page 88: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/88.jpg)
GDR: Generate Possible Repairs
88
CFD1:CFD2 :
Name SRC STR CT STT ZIPt1: Jim H1 REDWOODDR MICHIGANCITY MI 46360t2: Tom H1 REDWOODDR WESTVILLE IN 46360t3: Jeff H2 BIRCHPARKWAY WESTVILLE IN 46360t4: Rick H2 BIRCHPARKWAY WESTVILLE IN 46360t5: Mrk H1 BELLAVENUE FORTWAYNE IN 46391t6: Mark H1 BELLAVENUE FORTWAYNE IN 46825t7: Cady H2 BELLAVENUE FORTWAYNE IN 46825t8: Sindy H2 SHERDENRD FTWAYNE IN 46774
Suggested Update: replace City “FORT WAYNE” with “Westville” in t5 Suggested Upadte: replace Zip “46391” with “46825” in t5
![Page 89: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/89.jpg)
GDR: Group and Rank Repairs
89
Name SRC STR CT STT ZIPt1: Jim H1 REDWOODDR MICHIGANCITY MI 46360t2: Tom H1 REDWOODDR WESTVILLE IN 46360t3: Jeff H2 BIRCHPARKWAY WESTVILLE IN 46360t4: Rick H2 BIRCHPARKWAY WESTVILLE IN 46360t5: Mrk H1 BELLAVENUE FORTWAYNE IN 46391t6: Mark H1 BELLAVENUE FORTWAYNE IN 46825t7: Cady H2 BELLAVENUE FORTWAYNE IN 46825t8: Sindy H2 SHERDENRD FTWAYNE IN 46774
Update Group g1: The city should be “Michigan City” for {t2, t3, t4}.Update Group g2: The zip should be “46825” for {t5, t8}.….….….
Contextual grouping for the suggested updates
![Page 90: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/90.jpg)
KATARA [Chu et al, SIGMOD 2015]
90
A B C D E F GRossi Italy Rome Verona Italian Proto 1.78
Klate South Africa Pretoria Pirates Afrikaans P. Eliz. 1.69
Pirlo Italy Madrid Juve Italian Flero 1.77
t1
t2
t3
A Table of Soccer PlayersFD:B→C• Automatic: Produce heuristic repairs• GDR:
• Rely on redundancy to detect errors• Require heavy human involvement
Proposal: Use external trustworthy information!• KBs• Crowd experts
![Page 91: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/91.jpg)
KATARA Workflow
91
Pattern Validation
Return: a table pattern
Algorithm: entropy based scheduling
Data Annotation
Return: annotated data, new facts, top-k repairs
Algorithm: Inverted list based approach
Pattern Discovery
Return: candidate table patterns
Algorithm: rank-join
INPUT
Trusted KB K
Table T
OUTPUT
Possible repairs
Crowd validatedKB validated
Enriched KB K'
Table T'
KATARA
![Page 92: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/92.jpg)
Pattern Discovery: Generate Candidates
92
Generate candidate types for every column:
Generate candidate relationships for every column pair:
economycountrylocation
state…
type(B)City
CapitalwholeartifactPerson
…
type (C)
locatedInhasCapital
relationship (B, C)
![Page 93: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/93.jpg)
Crowd Pattern Validation
93
![Page 94: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/94.jpg)
Data Annotation
94
A B C D E F GRossi Italy Rome Verona Italian Proto 1.78
Klate South Africa Pretoria Pirates Afrikaans P. Eliz. 1.69
Pirlo Italy Madrid Juve Italian Flero 1.77
t1
t2
t3
![Page 95: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/95.jpg)
Data Repairing
95
G1 has cost 1 G2 has cost 5
A B C D E F GRossi Italy Rome Verona Italian Proto 1.78
Klate South Africa Pretoria Pirates Afrikaans P. Eliz. 1.69
Pirlo Italy Madrid Juve Italian Flero 1.77
t1
t2
t3
![Page 96: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/96.jpg)
Data Repairing Techniques Taxonomy
96
Data Repairing Techniques
Repair target(What to repair?)
Data Rules Both
Automation(How to repair?)
Automatic Human guided
Update model(Where to repair?)
In place Model based
HolisticPiece-meal
![Page 97: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/97.jpg)
97
One-Shot Data Cleaning
UncleanDatabase
DeterministicDatabase
DeterministicData Cleaning
RDBMS
Queries Results
Single Clean Instance
p Generate a single “trustworthy” instance
![Page 98: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/98.jpg)
98
Model Based Approach
Queries
UncleanDatabase
Uncertain Database
ProbabilisticData Cleaning
Uncertainty and Cleaning-aware RDBMS
Probabilistic Results
Single Clean Instance
Single Clean Instance
Multiple Possible Clean Instances
p Generate multiple possible clean instances
![Page 99: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/99.jpg)
99
Model Based Approach Challenges
1. The space of all possible repairs is huge
2. How to efficiently generate, store and query the possible repairs
![Page 100: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/100.jpg)
100
Two Example Model Based Approaches
p Duplicate Detection [Beskales et al, VLDB 2009]
n Spaces of Possible Repairsn Generating and Storing Possible Repairsn Query Possible Repairs
p Violations of Functional Dependencies [Beskales et al, VLDB 2010]
n Spaces of Possible Repairs n Sampling from a Meaningful Space of Repairs
![Page 101: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/101.jpg)
101
Two Example Model Based Approaches
p Duplicate Detection [Beskales et al, VLDB 2009]
n Spaces of Possible Repairsn Generating and Storing Possible Repairsn Query Possible Repairs
p Violations of Functional Dependencies [Beskales et al, VLDB 2010]
n Spaces of Possible Repairs n Sampling from a Meaningful Space of Repairs
![Page 102: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/102.jpg)
102
Typical Data Deduplication
ID name ZIP Income
P1 Green 51519 30k
P2 Green 51518 32k
P3 Peter 30528 40k
P4 Peter 30528 40k
P5 Gree 51519 55k
P6 Chuck 51519 30k
ID name ZIP Income
C1 Green 51519 39k
C2 Peter 30528 40k
C3 Chuck 51519 30k
Compute Pair-wiseSimilarity
P1 P2
P3 P4P5
P60.3 0.5
0.9
1.0
Cluster Similar Records
P1 P2
P3 P4P5
P6Merge
Clusters C1 C3
C2
Unclean Relation
Clean Relation
![Page 103: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/103.jpg)
103
Possible Repairs [Beskales et al, VLDB 2009]
p A possible repair is a clustering(partitioning) of the input tuples
Person
Uncertain Clustering
X1
{P1}
{P2}
{P3,P4}
{P5}
{P6}
X2
{P1,P2}
{P3,P4}
{P5}
{P6}
X3
{P1,P2,P5}
{P3,P4}
{P6}
Possible RepairsID Name ZIP Income
P1 Green 51519 30k
P2 Green 51518 32k
P3 Peter 30528 40k
P4 Peter 30528 40k
P5 Gree 51519 55k
P6 Chuck 51519 30k
![Page 104: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/104.jpg)
104
Generating Possible Repairs
r1 r2 r4 r5r3
Possible Thresholds [𝛕l , 𝛕u ] {r1},{r2},{r3},{r4},{r5}
{r1},{r2},{r3},{r4,r5}{r1,r2},{r3},{r4,r5}
{r1,r2},{r3,r4,r5}
r1 r2 r4 r5r3
Pair-wise
Distance
DistanceThreshold(𝛕) {r1,r2},{r3,r4,r5}
![Page 105: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/105.jpg)
105
Probabilities of Possible Repairs
p The probability of a repair is equal to the probability of the parameter range that generates such repair
𝛕l 𝛕u X1Pr
obab
ility
X2 X3 X4Probability Distribution of 𝛕
Probability Distribution of repairs
Prob
abili
ty
![Page 106: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/106.jpg)
106
Storing Possible Repairs
p U-Clean Relationsn Each cluster is stored oncen We keep the “lineage” of each cluster
ID … Income C PCP1 … 31k {P1,P2} [1,3)CP2 … 40k {P3,P4} [0,10)CP3 … 55k {P5} [0,3)CP4 … 30k {P6} [0,10)CP5 … 39k {P1,P2,P5} [3,10)CP6 … 30k {P1} [0,1)CP7 … 32k {P2} [0,1)
U-clean Relation PersonC
{P1,P2}
{P3,P4}
{P5}
{P6}
{P1,P2,P5}
{P3,P4}
{P6}
{P1}
{P2}
{P3,P4}
{P5}
{P6}
Clustering 1 Clustering 2 Clustering 3
1≤ 𝛕<3 3 ≤ 𝛕<100 ≤ 𝛕<1
![Page 107: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/107.jpg)
107
Example: Projection Query
Income C P30k {P1} v {P6} [0,1) v [3,10)
31k {P1,P2} [1,3)
32k {P2} [0,1)
40k{P3,P4} v
{P1,P2,P5} [0,1) v [3,10)
55k {P5} [0,3)
SELECT DISTINCT IncomeFROM Personc
ID … Income C PCP1 … 31k {P1,P2} [1,3)
CP2 … 40k {P3,P4} [0,1)
CP3 … 55k {P5} [0,3)
CP4 … 30k {P6} [3,10)
CP5 … 40k {P1,P2,P5} [3,10)
CP6 … 30k {P1} [0,1)
CP7 … 32k {P2} [0,1)
PersonC
![Page 108: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/108.jpg)
Big Data Cleaning Challenges
p Volumen Distributed Data Cleaningn Sample Clean
p Velocityn Incremental Data Cleaning
p Varietyn Graph/JSON/RDFn Text
108
![Page 109: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/109.jpg)
Distributed Data Deduplication [Chu et al, VLDB 2016]
p Data deduplication in data lake settingn A shared-nothing environmentn Need to compare every tuple pair
p The goal is to minimizingn Largest communication costn Largest computation cost
109
1 2 3 4 5 6
7 8 9 10 11
12 13 14 15
16 17 18
19 20
21
ti tj
![Page 110: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/110.jpg)
Conclusion and References
p Error Detectionn What (IC Languages and Discovery)
p P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proceedings of the 23rd International Conference on Data Engineering, pages 746–755, 2007.
p X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. Proceedings of the VLDB Endowment, 6(13):1498–1509, 2013.
p W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. Proceedings of the VLDB Endowment, 2(1):407–418, 2009.
p N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In Proceedings of the 25th International Conference on Data Engineering, pages 1275–1278, 2009.
p G. Fan, W. Fan, and F. Geerts. Detecting errors in numeric attributes.In Web-Age Information Management, pages 125–137. Springer, 2014.
p W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. Proceedings of the VLDB Endowment, 3(1-2):173–184, 2010.
p J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In Proceedings of the 2014 ACM SIGMOD International Confer- ence on Management of Data, pages 457–468. ACM, 2014.
p M. Interlandi and N. Tang. Proof positive and negative in data cleaning.In 31st IEEE International Conference on Data Engineering, 2015.
p Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate depen- dencies. Computer Journal, 42(2):100–111, 1999.
p C. M. Wyss, C. Giannella, and E. L. Robertson. FastFDs: A heuristic- driven, depth-first algorithm for mining functional dependencies from relation instances. In International Conference on Big Data Analytics and Knowledge Discovery, pages 101–110, 2001.
110
![Page 111: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/111.jpg)
Conclusion and References
p Error Detectionn How (Human involvement)
p X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In 29th IEEE International Conference on Data Engineering, pages 458–469, 2013.
p J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483– 1494, 2012.
n Where (Analytics Layer)p A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In
Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pages 445–456, 2014.
p A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 505–516, 2011.
p X. Wang, X Dong, and A. Meliou. Data X-Ray: A Diagnostic Tool for Data Errors . In Proceedings of the 2015 ACM SIGMOD International Conference on Management of data, pages 1231-1245, 2011.
p M. Bergman, T. Milo, S. Novgorodov, and W Tan. QOCO: A Query Oriented Data Cleaning System with Oracles. Proceedings of the VLDB Endowment, 8(12):1900– 1903, 2015.
111
![Page 112: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/112.jpg)
Conclusion and References
p Error Repairingn What (Data or Data & Rule)
p P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pages 143–154. ACM, 2005.
p X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In 29th IEEE International Conference on Data Engineering, pages 458–469, 2013.
p G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In 29th IEEE International Conference on Data Engineering, pages 541–552, 2013.
p F. Chiang and R. J. Miller. A unified model for data and constraint repair. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, pages 446–457, 2011.
n How (Human Involvement)p M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. Proceedings
of the VLDB Endowment, 4(5):279– 289, 2011.p X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning
system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD Inter- national Conference on Management of Data, pages 1247–1261, 2015.
112
![Page 113: Qualitative Data Cleaningxchu33/chu-papers/... · Many Technical Challenges p Record Linkage and Deduplication n Similarity measures n Machine learning for classifying pairs as duplicates](https://reader034.fdocuments.us/reader034/viewer/2022042308/5ed448edbda1e66ddf55260f/html5/thumbnails/113.jpg)
Conclusion and References
p Error Repairingn Where (Model-based)
p G. Beskales, M. A. Soliman, I. F. Ilyas, and S. Ben-David. Modeling and querying possible repairs in duplicate detection. Proceedings of the VLDB Endowment, pages 598–609, 2009.
p G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proceedings of the VLDB Endowment, 3(1-2):197–207, 2010.
p Taxonomy n I. F. Ilyas, and X. Chu. Trends in Cleaning Relational Data: Consistency and Deduplication . In
Foundations and Trends® in Databases, Volume 5, Issue 4, 2015
113