Light quality and dormancy overcoming in seed germination ...
Overcoming the Quality Curse
description
Transcript of Overcoming the Quality Curse
Overcoming the Quality Curse
Sharad MehrotraUniversity of California, Irvine
Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan
Zhang
Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari
Beyond DASFAA 2003 paper ..Beyond DASFAA 2003 paper ..
2
Improving Efficiency
Improving Quality
New Domains
Video dataImage dataSpeech dataSensor dataEntity Search People SearchLocation Search
DASFAA 2003
Data Cleaning – a vital component of Enterprise Data Cleaning – a vital component of Enterprise Data Processing WorkflowData Processing Workflow
3
Analysis/Mining
Data
ETL
Decisions
• Long term strategies
• Business decisions
• Historical data analyses • Trends, patterns, rules,
models, ..
Quality(Data) Quality(Data) Quality(Decisions) Quality(Decisions)
Quality of Data
Quality of Analysis
Quality of Decisions
Data Sources
OLTPPoint of saleOrganizationalcustomer data
Data Cleaning
4
Entity Resolution ProblemEntity Resolution Problem
Real World
Digital World
55
Standard Approach to Entity ResolutionStandard Approach to Entity Resolution
s (u,v) = f (u,v)
??
u v
J. Smith John Smith
Feature 2 Feature 2
Feature 3 Feature 3
[email protected] [email protected]
????
??
“Similarity function” “Feature-based similarity”
Deciding if two reference u and v co-refer
Analyzing their features
(if s(u,v) > t then u and v are declared to co-refer)
Measuring Quality of Entity ResolutionMeasuring Quality of Entity Resolution
Entity dispersion for an entity, into how many clusters its
repr. are clustered, ideal is 1
Cluster diversity for a cluster, how many distinct entities
it contains, ideal is 1
Measures: F-Measure. B-Cubed F-Measure. Variation of Information (VI). Generalized Merge Distance (GMD). …
1 11 1
2 22 2 2 2
1 1
Ideal Clustering
1 11 1
2 22 2 2
21
1
One Misassigned (Example 1)
1
1
1 1
2 22
2 2 2
1 1
Half Misassigned
1 11 1
2 22 2 2
211
One Misassigned (Example 2)
C1
C2
Div H
1
1
0
0
E1
E2
Dis H
1
1
0
0
C1
C2
Div H
2
2
E1
E2
Dis H
2
2
0.65
0.65
C1
C2
Div H
2
2
1
1
E1
E2
Dis H
2
2
1
1
C1
C2
Div H
2
1
0.592
0
E1
E2
Dis H
1
2
0
0.65
0.65
0.65
Dis/Div cannot distinguish the two cases
Entropy can: since 0.65 < 1, first clustering is better
Average entropy decreases (improves), compared to Example 1
The Quality Curse -- The Quality Curse -- Why Standard Why Standard ““Feature-basedFeature-based”” Approach leads to Poor Results Approach leads to Poor Results
• Significant entity dispersion.Significant entity dispersion.
• Significant cluster diversity.Significant cluster diversity.
7
Photo Collection of Sharad Mehrotra from Beijing, China June 2007 SIGMOD Trip
Photo Collection of Sharad Mehrotra from Beijing, China June 2007 SIGMOD Trip
Sharad Mehrotra, research interests: data management, Professor, UC Irvine
Sharad Mehrotra, research interests: data management, Professor, UC Irvine
S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his
bachelors from IIT, Kanpur in India
S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his
bachelors from IIT, Kanpur in India
S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor
machines. He received his bachelors from India.
S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor
machines. He received his bachelors from India.
Overcoming the Quality Curse (1)..Overcoming the Quality Curse (1)..
8
Look more carefully at data for additional evidences
9
Exploiting Relationships among EntitiesExploiting Relationships among Entities
A1, ‘Dave White’, ‘Intel’
A2, ‘Don White’, ‘CMU’
A3, ‘Susan Grey’, ‘MIT’
A4, ‘John Black’, ‘MIT’
A5, ‘Joe Brown’, unknown
A6, ‘Liz Pink’, unknown
P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’
P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’
P3, ‘Title3 . . .’, ‘Dave White’
P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’
P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’
P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’
Author table (clean) Publication table (to be cleaned)
?
w1 = ?
P1
P2
P3
Dave White
Don White
Susan Grey
John Black
Intel
CMU
MIT
1
Joe BrownP4
Liz Pink
P5
P62
w3 = ?
ER Graph
Context Attraction Context Attraction Principle (CAP): Principle (CAP): Nodes Nodes that are more connected that are more connected have a higher chance of have a higher chance of co-referring to the same co-referring to the same entityentity
Exploiting Relationships for ER Exploiting Relationships for ER Ph.D. Thesis, Stella ChenPh.D. Thesis, Stella Chen
• Formalizing the CAP principle Formalizing the CAP principle [SDM 05, IQIS 05][SDM 05, IQIS 05]
• Scaling to large graphs Scaling to large graphs [TODS 06][TODS 06]
• Self-Tuning Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11][DASFAA 07, JCDL 07, Journal IQ 11]– Not all relationships are equal – E.g., mutual interest in Bruce Lee movies possibly not as
important as being colleagues at a university for predicting co-authorship.
• Merging relationship evidence with other evidences Merging relationship evidence with other evidences [SIGMOD [SIGMOD ‘‘09]09]
• Applying to People search on Web Applying to People search on Web [ICDE [ICDE ‘‘07, TDKE 08, 07, TDKE 08, ICDE 09 (demo)]ICDE 09 (demo)]
10
Effectiveness of Exploiting RelationshipsEffectiveness of Exploiting Relationships
• WEPSWEPS
• MultimediaMultimedia
11
Smart Video SurveillanceSmart Video Surveillance• Camera Array to track human Camera Array to track human
activitiesactivities
CS Building in UC Irvine
Video collection
12
SurveillanceVideo
Database
SemanticExtraction
EventDatabase
Query/ Analysis
Event ModelEvent Model
SurveillanceVideo
Database
SemanticExtraction
EventDatabase
Query /Analysis
event
who
what
Other property
when
Activity recognitionFace recognition
localization
Temporal placement
extraction
Event model :
where
Query Examples:
Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or
embedded computing students?
Person Identification ChallengePerson Identification Challenge
Person Identification
14
event
who
what
Other property
when
Activity recognitionFace recognition
localization
Temporal placement
extraction
Event model :
where
Bob
other
Alice
???
Who ?
Traditional ApproachTraditional Approach
15
TraditionalApproach
FaceDetection
Face Recognition
???
Detect 70 faces/ 1000 images
2~3 images/ person
Poor Performance
Rationale for Poor PerformanceRationale for Poor Performance
16
resolution
(original)
(1/2 original)
(1/3 original)
Poor Quality of Data
No faces
Small faces
Low resolution
Low temporal Resolution
originalperformanc
e
Dropto
70%
Dropto
30%
Samplingrate
1 frame/sec
1/3 frame/sec
1/2 frame/se
c
1 frame/sec
originalperformanc
e
Dropto
53%
Dropto
35%
Effectiveness of Exploiting RelationshipsEffectiveness of Exploiting Relationships
• WEPSWEPS
• Multimedia [IQ2S PERCOM 2011]Multimedia [IQ2S PERCOM 2011]
17
Results on Face Clustering [ACM ICMR 2013 Results on Face Clustering [ACM ICMR 2013 Best Paper Award]Best Paper Award]
ResultsResults
High Precision,662 clusters31 Real Person,
631 merges
High Precision,203 clusters31 Real Person,
172 merges
4 Times
20
Overcoming the Quality Curse (2)..Overcoming the Quality Curse (2)..
Look outside the box
Exploiting Search Engine StatisticsExploiting Search Engine Statistics
Google Search results of “Andrew McCallum”
• Correlations amongst Correlations amongst context entities provide context entities provide additional source of additional source of information to resolve information to resolve entitiesentities
Sebastian Thrun AND Tom Mitchell
Andrew McCallum AND Sebastian Thrun AND Tom
Mitchell
(Machine Learning OR Text Retrieval) AND
(CRF OR UAI 2003)
Andrew McCallum AND (Machine Learning OR Text Retrieval )
AND (CRF OR UAI 2003)
Search Engine Queries to learn Search Engine Queries to learn correlations amongst contexts correlations amongst contexts
Sebastian ThrunMachine LearningText Retrieval
Tom MitchellCRFUAI 2003
Exploiting Web Search Engine Statistics Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray Ph.d. Thesis, Rabia Nuray
04/21/23 22
• Web Queries to Learn correlations [SIGIR 08]
• Application to Web People Search [WePS 09]
• Cluster refinement to overcome the singleton cluster problem [TODS 11-a]
• Making Web querying robust to server side fluctuations [tech. report]
• Scaling up the Web Query Technique [TODS 11-a]
Comparing with the State-of-the-art on WEPS-2 Comparing with the State-of-the-art on WEPS-2 DatasetDataset
2304/21/23
Observation/Conclusion…Observation/Conclusion…
• Additional Evidences can be exploited to Additional Evidences can be exploited to improve data quality improve data quality
• BUT …it is BUT …it is Expensive!!Expensive!!
• Example: Web Queries ApproachExample: Web Queries Approach– Number of queries : 4KNumber of queries : 4K2 2 ( ~ 40K for 100 results) ( ~ 40K for 100 results) – Very large to submit to a search engine & expect real-
time results– ~6-8 minutes (network costs, search engine load)
• Solutions:Solutions:– Local Caching of the Web– Ask only important queries – Reduces to 1-2 min. without degrading quality much
29
(Near) Future: Addressing the Efficiency Curse …(Near) Future: Addressing the Efficiency Curse …
30
Improving Efficiency
Improving Quality
New Domains
DASFAA 2003
Two complementary approachesTwo complementary approaches
– Pay as you go data cleaning – – Progressive algorithm to obtain best quality given budget
constraint
– Query driven data cleaning –– Perform minimal cleaning to answer query/analyses task.
Prevent having to clean unnecessary data.