A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing...

68
A Machine Learning Perspective on Managing Noisy Data Theodoros Rekatsinas | UW-Madison @thodrek

Transcript of A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing...

Page 1: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

A Machine Learning Perspective on Managing Noisy Data

Theodoros Rekatsinas | UW-Madison@thodrek

Page 2: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data-hungry applications are taking over

Page 3: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

• Noisy measurements

• Sensor failures

Data errors are everywhere

Page 4: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

• Uncertain extractions • Semantic ambiguity

Data errors are everywhere

Page 5: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

• Adversarial examples

Data errors are everywhere

Page 6: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

• Human errors • Machine failures • Code bugs

Data errors are everywhere

Page 7: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Achilles’ Heel of Modern Analytics

is low quality, erroneous data

Page 8: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Achilles’ Heel of Modern Analytics

is low quality, erroneous data

Cleaning and organizing the data comprises 60% of the time spent on an analytics or AI project.

Page 9: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Stanford’s Snorkel: A System for Fast Training Data CreationGoogle’s TFX: TensorFlow Data ValidationAmazon’s SageMakerAmazon’s Deequ: Data Quality Validation for ML PipelinesHoloClean: Weakly-supervised data cleaning

The Achilles’ Heel of Modern Analytics

is low quality, erroneous dataMany modern data management systems are

being developed to address aspects of this issue:

Page 10: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Question:What is an appropriate (formal) framework

for managing noisy data?

Things to consider: Simplicity and generality

Page 11: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Talk outline

• Managing Noisy Data (Background)

• The Probabilistic Unclean Databases (PUDs) Framework

• From Theory to Systems

Page 12: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Managing Noisy Data

Page 13: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

A simple example of noisy data

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Conflicts

ConflictDoes not obeydata distribution

Page 14: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

A simple example of noisy data

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Conflicts

ConflictDoes not obeydata distribution

Computational problems: Detect errors, repair errors, compute “consistent” query answers.

Page 15: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The case for inconsistent data

An example unclean database J

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

• Errors correspond to tuples/cells that introduce inconsistencies (violations of integrity constraints).

• Inconsistencies are typical in data integration, extract-load-transform workloads, etc.

• Data repairs: A theoretical framework for coping with inconsistent databases [Arenas et al. 1999]

Page 16: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

Slide by Phokion Kolaitis[SAT 2016]

Page 17: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

Slide by Phokion Kolaitis[SAT 2016]

Plethora of fundamental results on tractability of repair-checking and consistent query answering.

Page 18: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

Limited adoption in practice.

Slide by Phokion Kolaitis[SAT 2016]

Plethora of fundamental results on tractability of repair-checking and consistent query answering.

Page 19: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

Page 20: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Minimal subset repair: We remove t1

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

An example repaired database I

Page 21: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Minimal subset repair: We remove t1

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

Errors remain: (1) Cicago should clearly be Chicago(2) Non-obvious errors: 60609 is the wrong Zip

Page 22: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Minimal subset repair: We remove t1

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

Errors remain: (1) Cicago should clearly be Chicago(2) Non-obvious errors: 60609 is the wrong Zip

Several variations of minimal repairs. E.g., update the minimum

number of cells.

Page 23: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Minimal data repairs

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Minimal subset repair: We remove t1

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

Errors remain: (1) Cicago should clearly be Chicago(2) Non-obvious errors: 60609 is the wrong Zip

Minimality can be used as an operational principle to prioritize repairs but these repairs are not necessarily correct with respect to the ground truth.

Several variations of minimal repairs. E.g., update the minimum

number of cells.

Page 24: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The case for most probable data [Gribkoff et al., 14]

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

p

0.9

0.4

0.4

0.8

Most probable world, conditioned on integrity constraint satisfaction

Page 25: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

p

0.9

0.4

0.4

0.8

Factor (f)

1 - 0.9

0.4

0.4

0.8

maxI ∏

t∈I

p(t)∏t∉I

(1 − p(t))

Optimization Objective

The case for most probable data [Gribkoff et al., 14]

Page 26: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

p

0.9

0.4

0.4

0.8

Factor (f)

0.9

1 - 0.4

1 - 0.4

1 - 0.8

maxI ∏

t∈I

p(t)∏t∉I

(1 − p(t))

Optimization Objective

The case for most probable data [Gribkoff et al., 14]

Page 27: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Most probable repairs

Probabilities offer clear semantics than minimality. Fundamental question: How do we know p?

t2

t4

t1

t3

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

John Veliotis Sr.

Zip

60609

60608

60608

60609

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName p

0.9

0.4

0.4

0.8

Factor (f)

1 - 0.9

0.4

0.4

0.8

maxI ∏

t∈I

p(t)∏t∉I

(1 − p(t))Optimization Objective

Page 28: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Probabilistic Unclean DatabasesChristopher De Sa, Ihab Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas, ICDT 2019

Page 29: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The case of a noisy channel for data

Noisy Channel Model

1. We see an observation x in the noisy world

2. Find the correct world w

Applications: Speech, OCR, Spelling correction, Part of speech tagging, machine translations, etc…

w = arg maxw∈W

P(w |x)

Noisy Channel

Clean Source Data Observed Data with Errors

Page 30: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Noisy Channel

Clean Source Data Observed Data with Errors

Page 31: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Noisy Channel

Clean Intended Database I

Observed Data with Errors

Intension Probabilistic

Data Generator

Page 32: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Clean Intended Database I

Observed Unclean Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

Page 33: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Intension Probabilistic

Data Generator

A Probability Distribution

Component 1: Probability over tuple-values in I

Component 2: Logical constraints bias towards

consistency of tuples in I

Page 34: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database ModelA Conditional Probability Distribution

Example: Exponential

Family Realizer

Captures the conditional probability of data edits and

transformationsRealizer Probabilistic

Noise Generator (Noisy Channel)

RI(J) = Pr(J |I)

R[i, t](t0) =1

Z(t)exp

0

@X

g2G

wg · g(t, t0)

1

A

with t 2 I, t0 2 J and G is a set of features

where each g is an arbitrary function over (t, t0)

and each weight wg is a real number.

Probability of the i'th record of I changing from t to t'

Page 35: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Example PUD instances

PUD Example 1: Parfactor/Subset PUD

Models the generation of duplicate data

Prob. of extra tuples in J

Prob. of no-tuples

Page 36: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Example PUD instances

PUD Example 2: Parfactor/Update PUD

Models errors due to transformations (e.g., typos)

Prob. of edits present in J

Page 37: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Computational Problems in the PUD framework

Page 38: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Clean Intended Database I

Observed Unclean Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

Input: We only observe this

Page 39: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Clean Intended Database I

Observed Unclean Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

Input: We only observe this

Problem 1: If we knew the Intension

and the Realizer can we recover I?

Output: An estimate of the most probable I

Page 40: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Clean Intended Database I

Observed Unclean Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

Input: We only observe this

Problem 1: If we knew the Intension

and the Realizer can we recover I?

Output: An estimate of the most probable I

Problem 2: Given J can we answer a query

on I correctly? Output: Pr(a ∈ Q(I) |J)

Page 41: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database Model

Clean Intended Database I

Observed Unclean Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

Input: We only observe this

Problem 1: If we knew the Intension

and the Realizer can we recover I?

Output: An estimate of the most probable I

Problem 2: Given J can we answer a query

on I correctly? Output: Pr(a ∈ Q(I) |J)

Problem 3: Can we learn the Intension and the Realizer? Can we do that from J (i.e., without any training data)?

Output: An estimate for the Intension and the Realizer

Page 42: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data CleaningProblem Statement: Given the observed noisy database instance J, compute the Most Likely intended database instance I.

We show that PUDs generalize existing frameworks: - MLI in parfactor/subset PUDs generalizes cardinality repairs - MLI in parfactor/update PUDs generalizes min-update repairs

Question: How does data cleaning in PUDs compare to existing frameworks?

Page 43: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data CleaningProblem Statement: Given the observed noisy database instance J, compute the Most Likely intended database instance I.

Question: Is data cleaning in the PUD framework efficient?In general no. It is equivalent to probabilistic inference. However:- For parfactor/subset PUDs with key constraints (i.e., when errors are limited to

duplicates) MLI can be computed in polynomial time.- New result: Approximate inference algorithm with guarantees on expected Hamming Error w.r.t. I; uniform noise model [Heidari, Ilyas, Rekatsinas UAI 2019.]

Page 44: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Setup (with noise): • known graph G = (V,E)• unknown labeling X: V -> {1, 2, …, k}• given noisy parity of each edge• flipped with probability p

• given noisy observations for each node• altered with probability q

Goal: (approximately) recover X.Formally: want an algorithm A that finds a labeling that minimizes the worst-case expected Hamming error:

Approximate Inference in Structured Instances with Noisy Categorical Observations

maxX

{EL∼D(X)[error(X, X)]}

X

Page 45: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

New Algorithm: New approximate inference algorithm based on tree decompositions and correlation clustering.

Guarantees on worst-case expected Hamming error:

• For trees, the Hamming error is upper bounded by

• For low-treewidth graphs, the Hamming error is upper bounded by

Approximate Inference in Structured Instances with Noisy Categorical Observations

O(log(k) ⋅ p ⋅ n)

O(k ⋅ log(k) ⋅ p⌈ Δ(G)2 ⌉ ⋅ n)

Page 46: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

New Algorithm: New approximate inference algorithm based on tree decompositions and correlation clustering.

Guarantees on worst-case expected Hamming error:

• For trees, the Hamming error is upper bounded by

• For low-treewidth graphs, the Hamming error is upper bounded by

Approximate Inference in Structured Instances with Noisy Categorical Observations

O(log(k) ⋅ p ⋅ n)

O(k ⋅ log(k) ⋅ p⌈ Δ(G)2 ⌉ ⋅ n)

It should be for

the edge side information to be useful for statistical recovery.

p <1

k log k

Page 47: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

PUD learningProblem Statement: Assume a parametric representation of the Intention and the Realizer. We want to find the maximum likelihood estimates for the parameters of these representations.

Supervised variant: We are given examples of both unclean databases and their clean versions. Unsupervised variant: We are given only unclean databases.

Question: Can we learn a PUD? Can we do so without any training data?

- We show standard learnability results for supervised variant- More interesting result: We show that in the uniform noise model and under tuple independence we can learn a PUD without any training data when the noise is bounded. Single instance J decomposes to multiple training examples. Under bounded noise the log-likelihood is convex.

Page 48: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

From Theory to SystemsIs the PUDs framework useful in practice?

Page 49: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

HoloClean: Probabilistic Data Repairs

Reference: HoloClean: Holistic Data Repairs with Probabilistic InferenceRekatsinas, Chu, Ilyas, Ré, VLDB 2017

HoloClean is the first practical probabilistic data repairing engine and a state-of-the-art data repairing system

HoloClean’s factor-graph model is an instantiation of the PUDs Intention model.

HoloClean uses clean cells as training data to learn its PUD Intention model and uses the learned model to approximate MLI repairs.

Page 50: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Challenge: Inference under constraints is #P-complete

Applying probabilistic inference naively does not scale to data cleaning instances with millions of tuples

Idea 1: Prune domain of random variables.

Idea 2: Relax constraints over sets of random variables to features over independent random variables.

HoloClean: Probabilistic Data Repairs

Page 51: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

t1.City t1.Zip

t4.City t4.Zip

w1

w1

w2

w2

w3

“Address=3465 S

Morgan St”

t2

t4

t1

t3

Zip

60609

60608

60608

60609

3465 S Morgan ST ILCicago

3465 S Morgan ST ILChicago

ILChicago3465 S Morgan ST

Chicago3465 S

Morgan ST IL

StateCityAddress

“Zip -> City”“Address=

3465 S Morgan St”

Relaxing constraints

Page 52: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

t1.City

t4.City

w1

w1

w3’

“Address=3465 S

Morgan St”

t2

t4

t1

t3

Zip

60609

60608

60608

60609

3465 S Morgan ST ILCicago

3465 S Morgan ST ILChicago

ILChicago3465 S Morgan ST

Chicago3465 S

Morgan ST IL

StateCityAddress

“Assignment Chicago violates Zip -> City

due to t4”

w3’“Assignment Cicago violates Zip -> City

due to t1”

We have one relaxed factor for each value in the domain of the RV

Relaxing constraints

Page 53: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

t1.Zip

t4.Zip

w2

w2

w4’

“Address=3465 S

Morgan St”

t2

t4

t1

t3

Zip

60609

60608

60608

60609

3465 S Morgan ST ILCicago

3465 S Morgan ST ILChicago

ILChicago3465 S Morgan ST

Chicago3465 S

Morgan ST IL

StateCityAddress

“Assignment 60608 violates Zip -> City

due to t4”

w4’“Assignment 60609 violates Zip -> City

due to t1”

We have one relaxed factor for each value in the domain of the RV

Relaxing constraintsRelaxing constraints

Page 54: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Relaxing constraintsHoloClean in practice

HoloClean: our approach combining all signals and using inferenceHolistic[Chu,2013]: state-of-the-art for constraints & minimalityKATARA[Chu,2015]: state-of-the-art for external dataSCARE[Yakout,2013]: state-of-the-art ML & qualitative statistics

Competing methods do not scale or perform correct repairs.

Page 55: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Full Relaxed

F1-s

core

00.20.40.60.8

More domain pruning (lowers recall, increases precision)

F1-score for Full vs Relaxed Model

Full Relaxed

Run

time

(sec

)

0

1000

2000

More domain pruning (lowers recall, increases precision)

Runtime for Full vs Relaxed Model

Faster compilation, learning, and inference when we prune the RV domain

Relaxing constraints

Page 56: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Full Relaxed

F1-s

core

00.20.40.60.8

More domain pruning (lowers recall, increases precision)

F1-score for Full vs Relaxed Model

Increased robustness (more accurate repairs) when RV domain is ill-specified (no heavy pruning used)

Full Relaxed

Run

time

(sec

)

0

1000

2000

More domain pruning (lowers recall, increases precision)

Runtime for Full vs Relaxed Model

Relaxing constraints

Page 57: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data Augmentation for Error Detection

Error Detection with Data Augmentation

Transformation and Policy Learning

Data Augmentation using Policy Π

Augmentation Transformations Φ

and Policy Π

Data Augmentation Module

Cell Value Representation

Module

t1t2t3tN, City Chicago ILChicago IL

t1, City CicagoCicagoPortert1, Business ID EVP Cofee

Transformed Value

Observed Value

Cell

Augmented Training Dataset

Model Training and Classification

Module

IN: D, T, Σ Cell Representationand Labels

IN: D

IN: T

HoloDetect learns a PUD realizer and uses the learned realizer to generate synthetic training data to teach a deep neural network how to detect erroneous values.

Reference: HoloDetect: A Few-Shot Learning Framework for Error DetectionHeidari, McGrath, Ilyas, Rekatsinas, SIGMOD 2019

Page 58: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Error Detection: • Binary classification: for each cell decide

if it’s erroneous or correct.• Severe imbalance, high heterogeneity.• Assumption: Easy for human annotators

to provide examples of correct tuples.Challenge: How can we obtain labeled data while minimizing the input from human annotators?

Data Augmentation for Error Detection

Page 59: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data Augmentation for Error Detection

Approach: Analyze the input dataset and learn how errors are introduced (learn a noisy channel). Use the clean tuples as seeds and introduce

artificial erroneous examples that obey the distribution of the noisy channel.

Program Synthesis: Learn a program to introduce errods

Page 60: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data Augmentation for Error Detection

Page 61: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Data Augmentation for Error Detection

Approach: Train a classifier to identify errors in the input data set

Page 62: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

HoloDetect requires fewer training examples than competing approaches

Data Augmentation for Error Detection

Page 63: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

AutoFD: Functional Dependency Discovery via Structure Learning

FD discovery as a structure learning problem over a linear structured model

Lifted-variation of structure learning using sparse regression (L1-regularization).2x F1 improvement over state-of-the-art (included non-lifted structure learning methods).

Guarantees on FD discovery under a weak Realizer (bounded noise).

Graft

DBAName

Harry Caray’s

Pierrot

3435 W Washington

Chicago IL835 N Michigan

60608

60611

835 N Michigan Av

Mity Nice Bar

State

60612

Chicago

Address

60611

Foodlife IL

60612

835 N Michigan Av

City

IL

Zip Code

IL3493 Washington

IL

Chicago

Cicago

Input Noisy DatasetStructure Learning - Estimate the inverse covariance matrix of lifted model. - Fit a linear model by decomposing the estimated inverse covariance

Page 64: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

AutoFD provides insights for downstream data preparation tasks

Effective feature engineering

Page 65: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

Provides insights on the effectiveness of automated data cleaning.

AutoFD provides insights for downstream data preparation tasks

Ex: increased imputation accuracy for

attributes w. dependencies (in the

output of AutoFD)

Page 66: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database ModelClean Intended

Database IObserved Unclean

Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

A formal noisy channel model that leads to new insights for managing noisy data and has immediate

practical applications to data cleaning systems.

- HoloClean - AutoFD - HoloDetect

Page 67: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database ModelClean Intended

Database IObserved Unclean

Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications

to data cleaning systems and exciting connections to robust ML.

- HoloClean - AutoFD - HoloDetect

Page 68: A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere.

The Probabilistic Unclean Database ModelClean Intended

Database IObserved Unclean

Database J

Intension Probabilistic

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

- HoloClean - AutoFD - HoloDetect

Thank you! [email protected]

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications

to data cleaning systems and exciting connections to robust ML.