A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing...

A Machine Learning Perspective on Managing Noisy Data

Theodoros Rekatsinas | UW-Madison@thodrek

Data-hungry applications are taking over

• Noisy measurements

• Sensor failures

Data errors are everywhere

• Uncertain extractions • Semantic ambiguity

• Adversarial examples

• Human errors • Machine failures • Code bugs

The Achilles’ Heel of Modern Analytics

is low quality, erroneous data

Cleaning and organizing the data comprises 60% of the time spent on an analytics or AI project.

Stanford’s Snorkel: A System for Fast Training Data CreationGoogle’s TFX: TensorFlow Data ValidationAmazon’s SageMakerAmazon’s Deequ: Data Quality Validation for ML PipelinesHoloClean: Weakly-supervised data cleaning

is low quality, erroneous dataMany modern data management systems are

being developed to address aspects of this issue:

Question:What is an appropriate (formal) framework

for managing noisy data?

Things to consider: Simplicity and generality

Talk outline

• Managing Noisy Data (Background)

• The Probabilistic Unclean Databases (PUDs) Framework

• From Theory to Systems

Managing Noisy Data

A simple example of noisy data

c1: DBAName ! Zip

c2: Zip ! City, State

c3: City, State, Address ! Zip

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

3465 S Morgan ST ILJohnnyo’s Cicago

Johnnyo’s 3465 S Morgan ST ILChicago

Johnnyo’s ILChicago3465 S Morgan ST

Chicago3465 S

Morgan STJohnnyo’s IL

StateCityAddressAKAName

Conflicts

ConflictDoes not obeydata distribution

A simple example of noisy data

c1: DBAName ! Zip

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

Conflicts

ConflictDoes not obeydata distribution

Computational problems: Detect errors, repair errors, compute “consistent” query answers.

The case for inconsistent data

An example unclean database J

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

• Errors correspond to tuples/cells that introduce inconsistencies (violations of integrity constraints).

• Inconsistencies are typical in data integration, extract-load-transform workloads, etc.

• Data repairs: A theoretical framework for coping with inconsistent databases [Arenas et al. 1999]

Minimal data repairs

Slide by Phokion Kolaitis[SAT 2016]

Plethora of fundamental results on tractability of repair-checking and consistent query answering.

Limited adoption in practice.

Plethora of fundamental results on tractability of repair-checking and consistent query answering.

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

StateCityAddressAKAName c1: DBAName ! Zip

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

Minimal subset repair: We remove t1

c1: DBAName ! Zip

An example repaired database I

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

Errors remain: (1) Cicago should clearly be Chicago(2) Non-obvious errors: 60609 is the wrong Zip

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

Several variations of minimal repairs. E.g., update the minimum

number of cells.

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

Minimality can be used as an operational principle to prioritize repairs but these repairs are not necessarily correct with respect to the ground truth.

Several variations of minimal repairs. E.g., update the minimum

number of cells.

The case for most probable data [Gribkoff et al., 14]

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

Most probable world, conditioned on integrity constraint satisfaction

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

Factor (f)

1 - 0.9

maxI ∏

p(t)∏t∉I

(1 − p(t))

Optimization Objective

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

c1: DBAName ! Zip

Factor (f)

1 - 0.4

1 - 0.8

maxI ∏

p(t)∏t∉I

(1 − p(t))

Optimization Objective

Most probable repairs

Probabilities offer clear semantics than minimality. Fundamental question: How do we know p?

DBAName

John Veliotis Sr.

Johnnyo’s

John Veliotis Sr.

Chicago3465 S

StateCityAddressAKAName p

Factor (f)

1 - 0.9

maxI ∏

p(t)∏t∉I

(1 − p(t))Optimization Objective

Probabilistic Unclean DatabasesChristopher De Sa, Ihab Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas, ICDT 2019

The case of a noisy channel for data

Noisy Channel Model

1. We see an observation x in the noisy world

2. Find the correct world w

Applications: Speech, OCR, Spelling correction, Part of speech tagging, machine translations, etc…

w = arg maxw∈W

P(w |x)

Noisy Channel

Clean Source Data Observed Data with Errors

The Probabilistic Unclean Database Model

Noisy Channel

Clean Source Data Observed Data with Errors

Noisy Channel

Clean Intended Database I

Observed Data with Errors

Intension Probabilistic

Data Generator

Observed Unclean Database J

Data Generator

Realizer Probabilistic

Noise Generator (Noisy Channel)

Data Generator

A Probability Distribution

Component 1: Probability over tuple-values in I

Component 2: Logical constraints bias towards

consistency of tuples in I

The Probabilistic Unclean Database ModelA Conditional Probability Distribution

Example: Exponential

Family Realizer

Captures the conditional probability of data edits and

transformationsRealizer Probabilistic

RI(J) = Pr(J |I)

R[i, t](t0) =1

Z(t)exp

wg · g(t, t0)

with t 2 I, t0 2 J and G is a set of features

where each g is an arbitrary function over (t, t0)

and each weight wg is a real number.

Probability of the i'th record of I changing from t to t'

Example PUD instances

PUD Example 1: Parfactor/Subset PUD

Models the generation of duplicate data

Prob. of extra tuples in J

Prob. of no-tuples

Example PUD instances

PUD Example 2: Parfactor/Update PUD

Models errors due to transformations (e.g., typos)

Prob. of edits present in J

Computational Problems in the PUD framework

Data Generator

Input: We only observe this

Data Generator

Problem 1: If we knew the Intension

and the Realizer can we recover I?

Output: An estimate of the most probable I

Data Generator

Problem 2: Given J can we answer a query

on I correctly? Output: Pr(a ∈ Q(I) |J)

Data Generator

Problem 2: Given J can we answer a query

on I correctly? Output: Pr(a ∈ Q(I) |J)

Problem 3: Can we learn the Intension and the Realizer? Can we do that from J (i.e., without any training data)?

Output: An estimate for the Intension and the Realizer

Data CleaningProblem Statement: Given the observed noisy database instance J, compute the Most Likely intended database instance I.

We show that PUDs generalize existing frameworks: - MLI in parfactor/subset PUDs generalizes cardinality repairs - MLI in parfactor/update PUDs generalizes min-update repairs

Question: How does data cleaning in PUDs compare to existing frameworks?

Data CleaningProblem Statement: Given the observed noisy database instance J, compute the Most Likely intended database instance I.

Question: Is data cleaning in the PUD framework efficient?In general no. It is equivalent to probabilistic inference. However:- For parfactor/subset PUDs with key constraints (i.e., when errors are limited to

duplicates) MLI can be computed in polynomial time.- New result: Approximate inference algorithm with guarantees on expected Hamming Error w.r.t. I; uniform noise model [Heidari, Ilyas, Rekatsinas UAI 2019.]

Setup (with noise): • known graph G = (V,E)• unknown labeling X: V -> {1, 2, …, k}• given noisy parity of each edge• flipped with probability p

• given noisy observations for each node• altered with probability q

Goal: (approximately) recover X.Formally: want an algorithm A that finds a labeling that minimizes the worst-case expected Hamming error:

Approximate Inference in Structured Instances with Noisy Categorical Observations

{EL∼D(X)[error(X, X)]}

New Algorithm: New approximate inference algorithm based on tree decompositions and correlation clustering.

Guarantees on worst-case expected Hamming error:

• For trees, the Hamming error is upper bounded by

• For low-treewidth graphs, the Hamming error is upper bounded by

O(log(k) ⋅ p ⋅ n)

O(k ⋅ log(k) ⋅ p⌈ Δ(G)2 ⌉ ⋅ n)

New Algorithm: New approximate inference algorithm based on tree decompositions and correlation clustering.

Guarantees on worst-case expected Hamming error:

• For trees, the Hamming error is upper bounded by

• For low-treewidth graphs, the Hamming error is upper bounded by

O(log(k) ⋅ p ⋅ n)

O(k ⋅ log(k) ⋅ p⌈ Δ(G)2 ⌉ ⋅ n)

It should be for

the edge side information to be useful for statistical recovery.

k log k

PUD learningProblem Statement: Assume a parametric representation of the Intention and the Realizer. We want to find the maximum likelihood estimates for the parameters of these representations.

Supervised variant: We are given examples of both unclean databases and their clean versions. Unsupervised variant: We are given only unclean databases.

Question: Can we learn a PUD? Can we do so without any training data?

- We show standard learnability results for supervised variant- More interesting result: We show that in the uniform noise model and under tuple independence we can learn a PUD without any training data when the noise is bounded. Single instance J decomposes to multiple training examples. Under bounded noise the log-likelihood is convex.

From Theory to SystemsIs the PUDs framework useful in practice?

HoloClean: Probabilistic Data Repairs

Reference: HoloClean: Holistic Data Repairs with Probabilistic InferenceRekatsinas, Chu, Ilyas, Ré, VLDB 2017

HoloClean is the first practical probabilistic data repairing engine and a state-of-the-art data repairing system

HoloClean’s factor-graph model is an instantiation of the PUDs Intention model.

HoloClean uses clean cells as training data to learn its PUD Intention model and uses the learned model to approximate MLI repairs.

Challenge: Inference under constraints is #P-complete

Applying probabilistic inference naively does not scale to data cleaning instances with millions of tuples

Idea 1: Prune domain of random variables.

Idea 2: Relax constraints over sets of random variables to features over independent random variables.

HoloClean: Probabilistic Data Repairs

t1.City t1.Zip

t4.City t4.Zip

“Address=3465 S

Morgan St”

3465 S Morgan ST ILCicago

3465 S Morgan ST ILChicago

ILChicago3465 S Morgan ST

Chicago3465 S

Morgan ST IL

StateCityAddress

“Zip -> City”“Address=

3465 S Morgan St”

Relaxing constraints

t1.City

t4.City

“Address=3465 S

Morgan St”

Chicago3465 S

Morgan ST IL

StateCityAddress

“Assignment Chicago violates Zip -> City

due to t4”

w3’“Assignment Cicago violates Zip -> City

due to t1”

We have one relaxed factor for each value in the domain of the RV

t1.Zip

t4.Zip

“Address=3465 S

Morgan St”

Chicago3465 S

Morgan ST IL

StateCityAddress

“Assignment 60608 violates Zip -> City

due to t4”

w4’“Assignment 60609 violates Zip -> City

due to t1”

We have one relaxed factor for each value in the domain of the RV

Relaxing constraintsRelaxing constraints

Relaxing constraintsHoloClean in practice

HoloClean: our approach combining all signals and using inferenceHolistic[Chu,2013]: state-of-the-art for constraints & minimalityKATARA[Chu,2015]: state-of-the-art for external dataSCARE[Yakout,2013]: state-of-the-art ML & qualitative statistics

Competing methods do not scale or perform correct repairs.

Full Relaxed

00.20.40.60.8

More domain pruning (lowers recall, increases precision)

F1-score for Full vs Relaxed Model

Full Relaxed

Runtime for Full vs Relaxed Model

Faster compilation, learning, and inference when we prune the RV domain

Full Relaxed

00.20.40.60.8

F1-score for Full vs Relaxed Model

Increased robustness (more accurate repairs) when RV domain is ill-specified (no heavy pruning used)

Full Relaxed

Runtime for Full vs Relaxed Model

Data Augmentation for Error Detection

Error Detection with Data Augmentation

Transformation and Policy Learning

Data Augmentation using Policy Π

Augmentation Transformations Φ

and Policy Π

Data Augmentation Module

Cell Value Representation

Module

t1t2t3tN, City Chicago ILChicago IL

t1, City CicagoCicagoPortert1, Business ID EVP Cofee

Transformed Value

Observed Value

Augmented Training Dataset

Model Training and Classification

Module

IN: D, T, Σ Cell Representationand Labels

HoloDetect learns a PUD realizer and uses the learned realizer to generate synthetic training data to teach a deep neural network how to detect erroneous values.

Reference: HoloDetect: A Few-Shot Learning Framework for Error DetectionHeidari, McGrath, Ilyas, Rekatsinas, SIGMOD 2019

Error Detection: • Binary classification: for each cell decide

if it’s erroneous or correct.• Severe imbalance, high heterogeneity.• Assumption: Easy for human annotators

to provide examples of correct tuples.Challenge: How can we obtain labeled data while minimizing the input from human annotators?

Approach: Analyze the input dataset and learn how errors are introduced (learn a noisy channel). Use the clean tuples as seeds and introduce

artificial erroneous examples that obey the distribution of the noisy channel.

Program Synthesis: Learn a program to introduce errods

Approach: Train a classifier to identify errors in the input data set

HoloDetect requires fewer training examples than competing approaches

AutoFD: Functional Dependency Discovery via Structure Learning

FD discovery as a structure learning problem over a linear structured model

Lifted-variation of structure learning using sparse regression (L1-regularization).2x F1 improvement over state-of-the-art (included non-lifted structure learning methods).

Guarantees on FD discovery under a weak Realizer (bounded noise).

DBAName

Harry Caray’s

Pierrot

3435 W Washington

Chicago IL835 N Michigan

835 N Michigan Av

Mity Nice Bar

Chicago

Address

Foodlife IL

835 N Michigan Av

Zip Code

IL3493 Washington

Chicago

Cicago

Input Noisy DatasetStructure Learning - Estimate the inverse covariance matrix of lifted model. - Fit a linear model by decomposing the estimated inverse covariance

AutoFD provides insights for downstream data preparation tasks

Effective feature engineering

Provides insights on the effectiveness of automated data cleaning.

AutoFD provides insights for downstream data preparation tasks

Ex: increased imputation accuracy for

attributes w. dependencies (in the

output of AutoFD)

The Probabilistic Unclean Database ModelClean Intended

Database IObserved Unclean

Database J

Data Generator

A formal noisy channel model that leads to new insights for managing noisy data and has immediate

practical applications to data cleaning systems.

- HoloClean - AutoFD - HoloDetect

Database J

Data Generator

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications

to data cleaning systems and exciting connections to robust ML.

Database J

Data Generator

Thank you! thodrek@cs.wisc.edu

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications

to data cleaning systems and exciting connections to robust ML.

A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing...

Documents

Transcript of A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing...

1 Wireless Tomography in Noisy Environments using Machine ...rqiu/publications/wireless_tomography_… · Wireless Tomography in Noisy Environments using Machine Learning Z. Hu ∗,

China Noisy

November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.

RobustFill: Neural Program Learning under Noisy I/O · RobustFill: Neural Program Learning under Noisy I/O ... est problems in machine learning and artiﬁcial intelligence ... Neural

Allied Security Operations Group · 12/13/2020 · machine and/or software errors. The systemic errors are intentionalty designed to cteale errors In order to posh a high volume

Key Derivation from Noisy Sources with More Errors Than Entropy Benjamin Fuller Joint work with Ran Canetti, Omer Paneth, and Leonid Reyzin May 5, 2014.

Noisy-Channel Coding · Noisy-Channel Coding ... 6?? = p =

MEASUREMENT AND PREDICTION OF ERRORS FOR A THREE …research.sabanciuniv.edu/34639/1/MuhammadHassanYaqoob_10161508.pdfMEASUREMENT AND PREDICTION OF ERRORS FOR A THREE AXIS MACHINE

CMPE 58C Wireless Sensor Networksorkinos.cmpe.boun.edu.tr/netlab/courses/cmpe58C/... · Noisy Data and Noisy Fields –Adapt machine learning algorithms into the field –How to collect

Alignment in Machine Translationfor Machine Translation •The noisy channel model decomposes machine translation into two independent subproblems –Language modeling –Translation

Real-Time Compensation for Thermal Errors of the Milling Machine

PowerPoint Presentation · Geometric Errors in a Machine Tool Six errors for each linear axis plus three squareness errors (3-axes machine tool: 6 + 6 + 6 + 3 = 21 possible errors)

Meaningful Variable Names for Decompiled Code: A Machine ...Statistical Machine Translation (SMT) 12 •Noisy channel model •English àFrench: Go do some research! Vafaire de la

A Study of Line of Bearing Errors - DTIC1. INTRODUCTION Many systems use Line-of-Bearing (LOB) information to locate distant objects. The errors in location due to noisy LOB measurements

Detecting Grammatical Errors in Machine Translation Output ... · detecting grammatical errors in Dutch machine-translated text, using dependency parsing and treebank querying. We

Motors for Railway models...•Motors not made for model railway use • May be noisy, very noisy • I’ve seen motors intended for cordless screwdrivers and vending machine drives

Title “Noisy beets”: impact of phenotyping errors on ... · population with 20 % errors in the pedigree, the average genetic gain showed a reduction in the range 3.2–12.4 %

Evaluation of Machine Translation Errors in English and Iraqi Arabic

DIRECT NOISY SPEECH MODELING FOR NOISY-TO-NOISY VOICE ...

NX CAM Post Processing Errors: Machine Data File · PDF fileNX CAM Post Processing Errors: Machine Data File Generator vs. Post Builder ... Complete analysis was sent to the NX support