Data Repairing
description
Transcript of Data Repairing
![Page 1: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/1.jpg)
Data Repairing
Giorgos Flouris, FORTHDecember 11-12, 2012, Luxembourg
![Page 2: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/2.jpg)
Slide 2
StructurePart I: problem statement and proposed solution
(D2.2)◦ Sketch (also presented in the previous review)
Part II: complexity analysis and performance evaluation (D2.2)◦ Shows scalability and performance properties◦ Improved, compared to D2.2
Part III: application of repairing in a real setting (D4.4)◦ Result of collaboration between partners/WPs◦ Shows applicability, experimentation in real-world data
and setting
![Page 3: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/3.jpg)
Slide 3
PART I: Problem Statement and
Proposed Solution(D2.2)
![Page 4: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/4.jpg)
Slide 4
Validity as a Quality Indicator Validity is an important quality indicator
◦ Encodes context- or application-specific requirements◦ Applications may be useless over invalid data◦ Binary concept (valid/invalid)
Two steps to guarantee validity:1. Identifying invalid ontologies (diagnosis)
Detecting invalidities in an automated manner Subtask of Quality Assessment
2. Remove invalidities (repair) Repairing invalidities in an automated manner Subtask of Quality Enhancement
![Page 5: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/5.jpg)
Slide 5
Main IdeaExpressing validity using validity rules
over an adequate relational schema, e.g.:◦Properties must have a unique domain
◦p Prop(p) a Dom(p,a)◦p,a,b Dom(p,a) Dom(p,b) (a=b)
◦Correct classification in property instances◦x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)◦x,y,p,a P_Inst(x,y,p) Rng(p,a) C_Inst(y,a)
Syntactical manipulations on rules allow:◦Diagnosis (reduced to relational queries)◦Repair (identify repairing options per violation)
![Page 6: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/6.jpg)
Slide 6
Preferences for Repair• Which repairing option is best?
◦Ontology engineer determines that via preferences
Preferences◦Specified by ontology engineer beforehand◦High-level “specifications” for the ideal repair◦Serve as “instructions” to determine the
preferred (optimal) solution
![Page 7: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/7.jpg)
Slide 7
Preferences (On Ontologies)
O0
O2
O3
Score: 3
Score: 4
Score: 6
O1
![Page 8: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/8.jpg)
Slide 8
Preferences (On Deltas)
O0
O1
O2
O3Score: 2
Score: 4
Score: 5
-P_Inst (Item1,ST1, geo:location)
+C_Inst (Item1,Sensor)
-Dom (geo:location,
Sensor)
![Page 9: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/9.jpg)
Slide 9
PreferencesPreferences on ontologies are result-oriented
◦ Consider the quality of the repair result◦ Ignore the impact of repair◦ Popular options: prefer newest/trustable information,
prefer a specific ontological structurePreferences on deltas are impact-oriented
◦ Consider the impact of repair◦ Ignore the quality of the repair result◦ Popular options: minimize schema changes, minimize
addition/deletion of information, minimize delta sizeProperties of preferences
◦ Preferences on ontologies/deltas are equivalent◦ Quality metrics can be used for stating preferences◦ Metadata on the data can be used (e.g., provenance)◦ Can be qualitative or quantitative
![Page 10: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/10.jpg)
Slide 10
Generalizing the Approach For one violated rule
1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred (optimal) resolution
For many violated rules◦ Problem becomes more complicated◦ More than one resolution steps are required
Issues:1. Resolution order2. When and how to filter non-optimal solutions?3. Rule (and resolution) interdependencies
![Page 11: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/11.jpg)
Slide 11
Rule Interdependencies A given resolution may:
◦ Cause other violations (bad)◦ Resolve other violations (good)
Optimal resolution unknown ‘a priori’◦ Cannot predict a resolution’s ramifications◦ Exhaustive, recursive search required
(resolution tree) Two ways to create the resolution tree
◦ Globally-optimal (GO) / locally-optimal (LO)◦ When and how to filter non-optimal
solutions?
![Page 12: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/12.jpg)
Slide 12
Resolution Tree Creation (GO)– Find all minimal resolutions
for all the violated rules, then find the optimal ones
– Globally-optimal (GO)◦ Find all minimal resolutions
for one violation◦ Explore them all◦ Repeat recursively until valid◦ Return the optimal leaves
Optimal repairs (returned)
![Page 13: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/13.jpg)
Slide 13
Resolution Tree Creation (LO)– Find the minimal and
optimal resolutions for one violated rule, then repeat for the next
– Locally-optimal (LO)◦ Find all minimal resolutions
for one violation◦ Explore the optimal one(s)◦ Repeat recursively until valid◦ Return all remaining leaves
Optimal repair (returned)
![Page 14: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/14.jpg)
Slide 14
Comparison (GO versus LO)Characteristics of GO
◦ Exhaustive◦ Less efficient:
large resolution trees◦ Always returns optimal
repairs◦ Insensitive to rule
syntax◦ Does not depend on
resolution order
Characteristics of LO◦ Greedy◦ More efficient:
small resolution trees◦ Does not always return
optimal repairs◦ Sensitive to rule
syntax◦ Depends on resolution
order
![Page 15: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/15.jpg)
Slide 15
PART II: Complexity Analysis andPerformance Evaluation
(D2.2)
![Page 16: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/16.jpg)
Slide 16
Complexity AnalysisDetailed complexity analysis for GO/LO
and various different types of rules and preferences
Inherently difficult problem◦Exponential complexity (in general)◦Exception: LO is polynomial (in special cases)
Theoretical complexity is misleading as to the actual performance of the algorithms
![Page 17: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/17.jpg)
Slide 17
Performance in PracticePerformance in practice
◦Linear with respect to ontology size◦Linear with respect to tree size
Types of violated rules (tree width) Number of violations (tree height) – causes the
exponential blowup Rule interdependencies (tree height) Preference (for LO): affects pruning (tree width)
Further performance improvement◦Use optimizations◦Use LO with restrictive preference
![Page 18: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/18.jpg)
Slide 18
Effect of Ontology Size
499999.999999999 4999999.999999991.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
Diagnosis GO Repair 16 Violations GO Repair 26 ViolationsLO Repair 16 Violations LO Repair 26 Violations
Triples (x1000)
Exec
utio
n Ti
me
(sec
)(lo
gsca
le)
(logscale)
20000
![Page 19: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/19.jpg)
Slide 19
Effect of Tree Size (GO)
0 30000000 60000000 90000000 120000000 1500000000
200000
400000
600000
800000
1000000
1200000
1400000
1M5M10M15M20M
Nodes (x )
GO E
xecu
tion
Tim
e (s
ec)
610Nodes (x106)
![Page 20: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/20.jpg)
Slide 20
Effect of Tree Size (LO)
0 2000 4000 6000 8000 100000
5000
10000
15000
20000
25000
30000
35000
40000
1M5M10M15M20M
# Nodes
LO E
xecu
tion
Tim
e (s
ec)
![Page 21: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/21.jpg)
Slide 21
Effect of Violations (GO)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 280
200000
400000
600000
800000
1000000
1200000
1400000
1M5M10M15M20M
# Violations
GO E
xecu
tion
Tim
e (s
ec)
![Page 22: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/22.jpg)
Slide 22
Effect of Violations (LO)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 280
5000
10000
15000
20000
25000
30000
35000
40000
1M5M10M15M20M
# Violations
LO E
xecu
tion
Tim
e (s
ec)
![Page 23: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/23.jpg)
Slide 23
Effect of Preference (LO)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 2810000
100000
1000000
10000000
LO with P0 LO with P1LO with P2 LO with P3GO
# Violations
Exec
utio
n Ti
me
(sec
) (lo
gsca
le)
![Page 24: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/24.jpg)
Slide 24
Quality of LO Repairs
CCD
0 3 6 9 12 15 18 2101234567
# Violations
# Pr
ef. R
ep. D
elta
s
Max( )
0 3 6 9 12 15 18 210
200400600800
100012001400
GO∩LOGO\LOLO\GO
# Violations
Min( )
![Page 25: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/25.jpg)
Slide 25
PART III: Application of Repairing
in a Real Setting(D4.4)
![Page 26: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/26.jpg)
Slide 26
Objectives and Main IdeaRepair real datasets using preferences
based on metadataPurpose:
◦WP2: evaluate repairing in a real LOD setting◦WP3: Evaluate the usefulness of provenance,
recency etc as preferences for repair◦WP4: Validate the utility of WP4 resources for a
data quality benchmark
![Page 27: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/27.jpg)
Slide 27
Motivating ScenarioUser seeks information on Brazilian cities
◦Fuses Wikipedia dumps from various languagesGuarantees maximal coverage, but may
lead to conflicts ◦E.g., cities with two different population counts
Use repair to eliminate such conflicts◦Using our repairing method ◦Using adequate preferences
based on metadataEN
PT
ES FR
GE
![Page 28: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/28.jpg)
Slide 28
Experimental SettingInput
◦Fused 5 Wikipedias: EN, PT, SP, GE, FR◦Distilled information about three properties of
Brazilian cities: populationTotal, areaTotal, foundingDate
Repair parameters◦Validity rules: all properties must be functional◦Preferences: 5 preferences based on metadata
Evaluation◦Quality of result along 5 dimensions:
consistency, validity, conciseness, completeness, accuracy
![Page 29: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/29.jpg)
Slide 29
Preferences (1/2)1. PREFER_PT: resolve conflicts based on
source (PT>EN>SP>GE>FR)2. PREFER_RECENT: resolve conflicts based
on recency (most recent data is preferred)
3. PLAUSIBLE_PT: drop “irrational” data (population<500, area<300km2, founding date<1500AD); resolve remaining conflicts based on source
![Page 30: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/30.jpg)
Slide 30
Preferences (2/2)4. WEIGHTED_RECENT: resolve conflicts
based on recency, but if the conflicting records are almost equally recent (less than 3 months apart), then resolve based on source
5. CONDITIONAL_PT: resolve conflicts based on source but change the order depending on the data (prefer PT for small cities with population<500.000, prefer EN for the rest)
![Page 31: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/31.jpg)
Slide 31
Consistency, ValidityConsistency
◦Lack of conflicting triples◦Guaranteed to be perfect (by the repairing
algorithm), regardless of preferenceValidity
◦Lack of rule violations◦Coincides with consistency for this example◦Guaranteed to be perfect (by the repairing
algorithm), regardless of preference
![Page 32: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/32.jpg)
Slide 32
Conciseness, CompletenessConciseness
◦No duplicates in the final result◦Guaranteed to be perfect (by the fuse process),
regardless of preferenceCompleteness
◦Coverage of information◦Improved by fusion◦Unaffected by the repairing algorithm◦Input completeness = output completeness,
regardless of preference◦Measured to be at 77,02%
![Page 33: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/33.jpg)
Slide 33
AccuracyMost important metric for this experimentAccuracy
◦Closeness to the “actual state of affairs”◦Affected by the repairing choices
Compared repair with the Gold Standard ◦Taken from an official and independent data
source (IBGE)
![Page 34: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/34.jpg)
Slide 34
Accuracy ExamplesCity of Aracati
◦Population: 69159/69616 (conflicting)◦Record in Gold Standard: 69159◦Good choice: 69159◦Bad choice: 69616
City of Oiapoque◦Population: 20226/20426 (conflicting)◦Record in Gold Standard: 20509◦Optimal approximation choice: 20426◦Sub-optimal approximation choice: 20226
![Page 35: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/35.jpg)
Slide 35
Accuracy Results
![Page 36: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/36.jpg)
Slide 36
Accuracy of Input and Output
![Page 37: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/37.jpg)
Slide 37
Publications Yannis Roussakis, Giorgos Flouris, Vassilis Christophides.
Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011.
Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki. Using Provenance for Quality Assessment and Repair in Linked Open Data. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn-12), 2012.
Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF(S) DBs. Under review in TODS Journal.
![Page 38: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/38.jpg)
Slide 38
BACKUP SLIDES
![Page 39: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/39.jpg)
Slide 39
Repair Removing invalidities by changing the
ontology in an adequate manner General concerns:
1. Return a valid ontology– Strict requirement
2. Minimize the impact of repair upon the data– Make minor, targeted modifications that repair
the ontology without changing it too much3. Return a “good” repair
– Emulate the changes that the ontology engineer would do for repairing the ontology
![Page 40: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/40.jpg)
Slide 40
InferenceInference expressed using validity rulesExample:
◦Transitivity of class subsumption◦a,b,c C_Sub(a,b) C_Sub(b,c) C_Sub(a,c)
In practice we use labeling algorithms ◦Avoid explicitly storing the inferred knowledge◦Improve efficiency of reasoning
![Page 41: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/41.jpg)
Slide 41
Ontology O0Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)
Example (Diagnosis/Repair)
Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)
Sensor SpatialThing
Observation
Item1 ST1
geo:location
Schema
Data
Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor
P_Inst(Item1,ST1,geo:location)O0
Remove P_Inst(Item1,ST1,geo:location)
Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)
C_Inst(Item1,Sensor)O0
Dom(geo:location,Sensor)O0
![Page 42: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/42.jpg)
Slide 42
Quality Assessment Quality = “fitness for use”
◦ Multi-dimensional, multi-faceted, context-dependent Methodology for quality assessment
◦ Dimensions Aspects of quality Accuracy, completeness, timeliness, …
◦ Indicators Metadata values for measuring dimensions Last modification date (related to timeliness)
◦ Scoring Functions Functions to quantify quality indicators Days since last modification date
◦ Metrics Measures of dimensions (result of scoring function) Can be combined
![Page 43: Data Repairing](https://reader036.fdocuments.us/reader036/viewer/2022062315/56816553550346895dd7cf2a/html5/thumbnails/43.jpg)
Slide 43
en.dbpedia pt.dbpedia
integrated data
GoldStandard
Instituto Brasileiro de Geografia e Estatística
(IBGE)
Fuse/Repair
Compare
Accuracy
dbpedia:areaTotaldbpedia:populationTotaldbpedia:foundingDate
dbpedia:areaTotaldbpedia:populationTotaldbpedia:foundingDate
Accuracy Evaluationfr.dbpedia
…