Tamr // Data Driven NYC // September 2014

23
Moving Data Curation from Theory to Practice Ihab Ilyas University of Waterloo

description

Tamr Co-Founder Ihab Ilyas presented at September 2014's edition of Data Driven NYC. Tamr is a data connection platform.

Transcript of Tamr // Data Driven NYC // September 2014

Page 1: Tamr // Data Driven NYC // September 2014

Moving Data Curation from Theory to Practice

Ihab Ilyas

University of Waterloo

Page 2: Tamr // Data Driven NYC // September 2014

2

Page 3: Tamr // Data Driven NYC // September 2014

Data Curation: Many Definitions and One Goal

Extract Value from Data

3

Page 4: Tamr // Data Driven NYC // September 2014

Many Technical Challenges

Automatic Schema Mapping

4

Example: a global equipment manufacturer with thousands of products across

hundreds of databases from multiple suppliers talking about the same part numbers

<Part Number>

Page 5: Tamr // Data Driven NYC // September 2014

Many Technical Challenges

Record Linkage and Deduplication

5

Example: Thomson Reuters spent 6 months on a single deduplication project

of a subset of their data sources

Record 1

Record 2

Record 3

Record 4

Unified Record

Page 6: Tamr // Data Driven NYC // September 2014

Many Technical Challenges

Missing Values

6

Example: Most real data collected from sensors, surveys, agents, have a high

percentage of N/A or nulls, special values (99999) etc.

Page 7: Tamr // Data Driven NYC // September 2014

ID Name ZIP City State Income

1 Green 60610 Chicago IL 30k

2 Green 60611 Chicago IL 32k

3 Peter New Yrk NY 40k

4 John 11507 New York NY 40k

5 Gree 90057 Los Angeles CA 55k

6 Chuck 90057 San Francisco CA 30k

Common Data Quality Issues

7

Duplicates

Syntactic ErrorIntegrity Constraint Violation

Missing Value

1 Green 60610 Chicago IL 31k

11507 New York

Los Angeles

Page 8: Tamr // Data Driven NYC // September 2014

Are We Missing the Real Challenges?

"Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group

8

Page 9: Tamr // Data Driven NYC // September 2014

Realities of Data Curation Efforts

Data is owned by people and is not an orphan

9

Result: Fully automated cleaning will probably never be adopted

in an enterprise setting

Data

StewardsIT

Data

Experts

Constant

Interaction

Page 10: Tamr // Data Driven NYC // September 2014

Realities of Data Curation Efforts

Scale renders most solutions un-deployable

10

Result: Need to rethink all cleaning algorithms including record

linkage to work at scale and avoid quadratic complexity. De-duping

one million records naïvely can take weeks (even on a big machine)

Page 11: Tamr // Data Driven NYC // September 2014

Realities of Data Curation Efforts

Data Variety is even worse

11

Result: Curation requires its own stack including transformations and adaptors

Page 12: Tamr // Data Driven NYC // September 2014

Realities of Data Curation Efforts

Iterative by nature -- not by design

12

Result: Need to be incremental, agile and low startup overhead. A curation

solution should be a part of the data production line

Data stream

Page 13: Tamr // Data Driven NYC // September 2014

Pragmatic ≠ Unprincipled

Page 14: Tamr // Data Driven NYC // September 2014

Leverage All Data• Data ownership

• Scale

• Variety

• Incremental cleaning

[CIDR 2013]

Page 15: Tamr // Data Driven NYC // September 2014

Structured and

Semi-structured

Data Sources

Collaborative

Curation

Data Experts

(Source owners)

Data Stewards

and Curators

Data

Inventory

APIs

Systems

Tools

Data

Scientists

Advanced

Algorithms &

Machine

Learning

Expert

Input

Integrated

Data &

Metadata

Identify sources, understand relationships and curate the massive variety of siloed data

Expert

Directory

Tamr: Machine Learning with Human Insight

15

Data OwnershipNon-programmatic Interfaces

Page 16: Tamr // Data Driven NYC // September 2014

Structured and

Semi-structured

Data Sources

Collaborative

Curation

Data Experts

(Source owners)

Data Stewards

and Curators

Data

Inventory

APIs

Systems

Tools

Data

Scientists

Advanced

Algorithms &

Machine

Learning

Expert

Input

Integrated

Data &

Metadata

Identify sources, understand relationships and curate the massive variety of siloed data

Expert

Directory

Tamr: Machine Learning with Human Insight

16

Scale & Variety

Page 17: Tamr // Data Driven NYC // September 2014

Structured and

Semi-structured

Data Sources

Collaborative

Curation

Data Experts

(Source owners)

Data Stewards

and Curators

Data

Inventory

APIs

Systems

Tools

Data

Scientists

Advanced

Algorithms &

Machine

Learning

Expert

Input

Integrated

Data &

Metadata

Identify sources, understand relationships and curate the massive variety of siloed data

Expert

Directory

Tamr: Machine Learning with Human Insight

17

Incremental

Page 18: Tamr // Data Driven NYC // September 2014

18

Example Tamr Functionality: Entity Resolution

ID name ZIP Income

P1 Green 51519 30k

P2 Green 51518 32k

P3 Peter 30528 40k

P4 Peter 30528 40k

P5 Gree 51519 55k

P6 Chuck 51519 30k

ID name ZIP Income

C1 Green 51519 39k

C2 Peter 30528 40k

C3 Chuck 51519 30k

Compute Pair-wiseSimilarity

P1 P2

P3 P4P5

P60.3 0.5

0.9

1.0

Cluster Similar

Records

P1 P2

P3 P4P5

P6

MergeClusters C1 C3

C2

Relation with duplicates

Clean Relation

Page 19: Tamr // Data Driven NYC // September 2014

Tamr Solution

• Object linkage model– Treats schema mapping and record linkage as one process

– Feature extraction and evidence accumulation

• Novel fuzzy blocking– A hierarchy of classifiers from binning candidates in the same comparison

clusters to aggressive linkage of duplicates

• Open-channel with humans in different capacities– Expert to increase the training sets

– Stewards for verification and application of updates

– IT for modeling user enterprise rules

19

Page 20: Tamr // Data Driven NYC // September 2014

Tamr Architecture

20

Page 21: Tamr // Data Driven NYC // September 2014

Suggestions to Build a Unified Schema

21

Page 22: Tamr // Data Driven NYC // September 2014

ML to Match Records Across Sources

22

Page 23: Tamr // Data Driven NYC // September 2014

www.tamr.com

Thank you