Overcoming the Quality Curse

25
Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari

description

Overcoming the Quality Curse. Sharad Mehrotra University of California, Irvine. Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari. Beyond DASFAA 2003 paper. - PowerPoint PPT Presentation

Transcript of Overcoming the Quality Curse

Page 1: Overcoming the Quality Curse

Overcoming the Quality Curse

Sharad MehrotraUniversity of California, Irvine

Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan

Zhang

Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari

Page 2: Overcoming the Quality Curse

Beyond DASFAA 2003 paper ..Beyond DASFAA 2003 paper ..

2

Improving Efficiency

Improving Quality

New Domains

Video dataImage dataSpeech dataSensor dataEntity Search People SearchLocation Search

DASFAA 2003

Page 3: Overcoming the Quality Curse

Data Cleaning – a vital component of Enterprise Data Cleaning – a vital component of Enterprise Data Processing WorkflowData Processing Workflow

3

Analysis/Mining

Data

ETL

Decisions

• Long term strategies

• Business decisions

• Historical data analyses • Trends, patterns, rules,

models, ..

Quality(Data) Quality(Data) Quality(Decisions) Quality(Decisions)

Quality of Data

Quality of Analysis

Quality of Decisions

Data Sources

OLTPPoint of saleOrganizationalcustomer data

Data Cleaning

Page 4: Overcoming the Quality Curse

4

Entity Resolution ProblemEntity Resolution Problem

Real World

Digital World

Page 5: Overcoming the Quality Curse

55

Standard Approach to Entity ResolutionStandard Approach to Entity Resolution

s (u,v) = f (u,v)

??

u v

J. Smith John Smith

Feature 2 Feature 2

Feature 3 Feature 3

[email protected] [email protected]

????

??

“Similarity function” “Feature-based similarity”

Deciding if two reference u and v co-refer

Analyzing their features

(if s(u,v) > t then u and v are declared to co-refer)

Page 6: Overcoming the Quality Curse

Measuring Quality of Entity ResolutionMeasuring Quality of Entity Resolution

Entity dispersion for an entity, into how many clusters its

repr. are clustered, ideal is 1

Cluster diversity for a cluster, how many distinct entities

it contains, ideal is 1

Measures: F-Measure. B-Cubed F-Measure. Variation of Information (VI). Generalized Merge Distance (GMD). …

1 11 1

2 22 2 2 2

1 1

Ideal Clustering

1 11 1

2 22 2 2

21

1

One Misassigned (Example 1)

1

1

1 1

2 22

2 2 2

1 1

Half Misassigned

1 11 1

2 22 2 2

211

One Misassigned (Example 2)

C1

C2

Div H

1

1

0

0

E1

E2

Dis H

1

1

0

0

C1

C2

Div H

2

2

E1

E2

Dis H

2

2

0.65

0.65

C1

C2

Div H

2

2

1

1

E1

E2

Dis H

2

2

1

1

C1

C2

Div H

2

1

0.592

0

E1

E2

Dis H

1

2

0

0.65

0.65

0.65

Dis/Div cannot distinguish the two cases

Entropy can: since 0.65 < 1, first clustering is better

Average entropy decreases (improves), compared to Example 1

Page 7: Overcoming the Quality Curse

The Quality Curse -- The Quality Curse -- Why Standard Why Standard ““Feature-basedFeature-based”” Approach leads to Poor Results Approach leads to Poor Results

• Significant entity dispersion.Significant entity dispersion.

• Significant cluster diversity.Significant cluster diversity.

7

Photo Collection of Sharad Mehrotra from Beijing, China June 2007 SIGMOD Trip

Photo Collection of Sharad Mehrotra from Beijing, China June 2007 SIGMOD Trip

Sharad Mehrotra, research interests: data management, Professor, UC Irvine

Sharad Mehrotra, research interests: data management, Professor, UC Irvine

S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his

bachelors from IIT, Kanpur in India

S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his

bachelors from IIT, Kanpur in India

S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor

machines. He received his bachelors from India.

S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor

machines. He received his bachelors from India.

Page 8: Overcoming the Quality Curse

Overcoming the Quality Curse (1)..Overcoming the Quality Curse (1)..

8

Look more carefully at data for additional evidences

Page 9: Overcoming the Quality Curse

9

Exploiting Relationships among EntitiesExploiting Relationships among Entities

A1, ‘Dave White’, ‘Intel’

A2, ‘Don White’, ‘CMU’

A3, ‘Susan Grey’, ‘MIT’

A4, ‘John Black’, ‘MIT’

A5, ‘Joe Brown’, unknown

A6, ‘Liz Pink’, unknown

P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’

P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’

P3, ‘Title3 . . .’, ‘Dave White’

P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’

P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’

P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’

Author table (clean) Publication table (to be cleaned)

?

w1 = ?

P1

P2

P3

Dave White

Don White

Susan Grey

John Black

Intel

CMU

MIT

1

Joe BrownP4

Liz Pink

P5

P62

w3 = ?

ER Graph

Context Attraction Context Attraction Principle (CAP): Principle (CAP): Nodes Nodes that are more connected that are more connected have a higher chance of have a higher chance of co-referring to the same co-referring to the same entityentity

Page 10: Overcoming the Quality Curse

Exploiting Relationships for ER Exploiting Relationships for ER Ph.D. Thesis, Stella ChenPh.D. Thesis, Stella Chen

• Formalizing the CAP principle Formalizing the CAP principle [SDM 05, IQIS 05][SDM 05, IQIS 05]

• Scaling to large graphs Scaling to large graphs [TODS 06][TODS 06]

• Self-Tuning Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11][DASFAA 07, JCDL 07, Journal IQ 11]– Not all relationships are equal – E.g., mutual interest in Bruce Lee movies possibly not as

important as being colleagues at a university for predicting co-authorship.

• Merging relationship evidence with other evidences Merging relationship evidence with other evidences [SIGMOD [SIGMOD ‘‘09]09]

• Applying to People search on Web Applying to People search on Web [ICDE [ICDE ‘‘07, TDKE 08, 07, TDKE 08, ICDE 09 (demo)]ICDE 09 (demo)]

10

Page 11: Overcoming the Quality Curse

Effectiveness of Exploiting RelationshipsEffectiveness of Exploiting Relationships

• WEPSWEPS

• MultimediaMultimedia

11

Page 12: Overcoming the Quality Curse

Smart Video SurveillanceSmart Video Surveillance• Camera Array to track human Camera Array to track human

activitiesactivities

CS Building in UC Irvine

Video collection

12

SurveillanceVideo

Database

SemanticExtraction

EventDatabase

Query/ Analysis

Page 13: Overcoming the Quality Curse

Event ModelEvent Model

SurveillanceVideo

Database

SemanticExtraction

EventDatabase

Query /Analysis

event

who

what

Other property

when

Activity recognitionFace recognition

localization

Temporal placement

extraction

Event model :

where

Query Examples:

Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or

embedded computing students?

Page 14: Overcoming the Quality Curse

Person Identification ChallengePerson Identification Challenge

Person Identification

14

event

who

what

Other property

when

Activity recognitionFace recognition

localization

Temporal placement

extraction

Event model :

where

Bob

other

Alice

???

Who ?

Page 15: Overcoming the Quality Curse

Traditional ApproachTraditional Approach

15

TraditionalApproach

FaceDetection

Face Recognition

???

Detect 70 faces/ 1000 images

2~3 images/ person

Poor Performance

Page 16: Overcoming the Quality Curse

Rationale for Poor PerformanceRationale for Poor Performance

16

resolution

(original)

(1/2 original)

(1/3 original)

Poor Quality of Data

No faces

Small faces

Low resolution

Low temporal Resolution

originalperformanc

e

Dropto

70%

Dropto

30%

Samplingrate

1 frame/sec

1/3 frame/sec

1/2 frame/se

c

1 frame/sec

originalperformanc

e

Dropto

53%

Dropto

35%

Page 17: Overcoming the Quality Curse

Effectiveness of Exploiting RelationshipsEffectiveness of Exploiting Relationships

• WEPSWEPS

• Multimedia [IQ2S PERCOM 2011]Multimedia [IQ2S PERCOM 2011]

17

Page 18: Overcoming the Quality Curse

Results on Face Clustering [ACM ICMR 2013 Results on Face Clustering [ACM ICMR 2013 Best Paper Award]Best Paper Award]

Page 19: Overcoming the Quality Curse

ResultsResults

High Precision,662 clusters31 Real Person,

631 merges

High Precision,203 clusters31 Real Person,

172 merges

4 Times

Page 20: Overcoming the Quality Curse

20

Overcoming the Quality Curse (2)..Overcoming the Quality Curse (2)..

Look outside the box

Page 21: Overcoming the Quality Curse

Exploiting Search Engine StatisticsExploiting Search Engine Statistics

Google Search results of “Andrew McCallum”

• Correlations amongst Correlations amongst context entities provide context entities provide additional source of additional source of information to resolve information to resolve entitiesentities

Sebastian Thrun AND Tom Mitchell

Andrew McCallum AND Sebastian Thrun AND Tom

Mitchell

(Machine Learning OR Text Retrieval) AND

(CRF OR UAI 2003)

Andrew McCallum AND (Machine Learning OR Text Retrieval )

AND (CRF OR UAI 2003)

Search Engine Queries to learn Search Engine Queries to learn correlations amongst contexts correlations amongst contexts

Sebastian ThrunMachine LearningText Retrieval

Tom MitchellCRFUAI 2003

Page 22: Overcoming the Quality Curse

Exploiting Web Search Engine Statistics Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray Ph.d. Thesis, Rabia Nuray

04/21/23 22

• Web Queries to Learn correlations [SIGIR 08]

• Application to Web People Search [WePS 09]

• Cluster refinement to overcome the singleton cluster problem [TODS 11-a]

• Making Web querying robust to server side fluctuations [tech. report]

• Scaling up the Web Query Technique [TODS 11-a]

Page 23: Overcoming the Quality Curse

Comparing with the State-of-the-art on WEPS-2 Comparing with the State-of-the-art on WEPS-2 DatasetDataset

2304/21/23

Page 24: Overcoming the Quality Curse

Observation/Conclusion…Observation/Conclusion…

• Additional Evidences can be exploited to Additional Evidences can be exploited to improve data quality improve data quality

• BUT …it is BUT …it is Expensive!!Expensive!!

• Example: Web Queries ApproachExample: Web Queries Approach– Number of queries : 4KNumber of queries : 4K2 2 ( ~ 40K for 100 results) ( ~ 40K for 100 results) – Very large to submit to a search engine & expect real-

time results– ~6-8 minutes (network costs, search engine load)

• Solutions:Solutions:– Local Caching of the Web– Ask only important queries – Reduces to 1-2 min. without degrading quality much

29

Page 25: Overcoming the Quality Curse

(Near) Future: Addressing the Efficiency Curse …(Near) Future: Addressing the Efficiency Curse …

30

Improving Efficiency

Improving Quality

New Domains

DASFAA 2003

Two complementary approachesTwo complementary approaches

– Pay as you go data cleaning – – Progressive algorithm to obtain best quality given budget

constraint

– Query driven data cleaning –– Perform minimal cleaning to answer query/analyses task.

Prevent having to clean unnecessary data.