Workshop on Global Scientific Data Infrastructures: The...
Transcript of Workshop on Global Scientific Data Infrastructures: The...
![Page 1: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/1.jpg)
111
Workshop on Global Scientific Data Infrastructures:
The Big Data ChallengesCapri 12-13 May 2011
Carlo BatiniUniversity of Milano Bicocca
Data Quality
![Page 2: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/2.jpg)
22
State of the art in 2006
State of the art in Data Quality for business/administrative data managed in relational data bases:
• Dimensions• Metrics• Models• Techniques• Methodologies for
assessment and improvement
2006
![Page 3: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/3.jpg)
Considered about 65 papersaccording to three evolutive trends
DB&IS data
Web data
Scientific data
2006-2011
1980-2011
2000-2011
![Page 4: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/4.jpg)
Main issues in the literature around DQ
Type of data repre-sentation
DQ Dimension
Formal Technique/Formal model
Data source
Applicationdomain
ICT Technologyconsidered
Type of assessment/Improvement method
Lot of them!
Lot of them!
• Generic• Administrative• Business• Aggregate DW• Scientific• Web• UGC, Newspapers and TVs
• Structured data• Semistrutured d.• Unstructured d.• Specific sd. (e.g.
laws)• Maps• Images• Hybrid
• (Relational) DBMS’s• Sensor Networks• Receptors (RFID)• Cloud/grid
• Biosciences & Genetics
• Chemistry• Earth & Environment• Health sciences• Metereology• Neurobiology• Zoology• Oceanography• Physics• Universe & Astronomy• Decision Support
Systems• Operational processes
![Page 5: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/5.jpg)
The three worldsDB&IS/Web/Scientific data
together
![Page 6: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/6.jpg)
Bridge-authors
6
DB & IS authors
SD authors
WEB authors
Missier et al.
Geerts
Srivastava
Widom
Elmagarmid
Bertino
Gray
Borgida &Mylopoulos
Batini Cappiello
Weikum
![Page 7: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/7.jpg)
Areas of contributions for DB&IS, Web and Sc data
7
Issue DB & IS Web ScDModel 26% 22% 22%Technique/Algorithm 20% 38% 28%Methodology 17% 7%Guidelines 14% 30%Framework 10% 9%Survey 17%Extension to new techn. 5%Analysis 5% 26% 4%Conjecture 5%
![Page 8: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/8.jpg)
Classifications:DB&IS vs SD QD classifications
DB & IS data - 11. Intrinsic2. Contextual3. Representational4. Accessibility
DB & IS data - 21. Product – Conforms to specs2. Product – Meets consumer
expectations3. Service – Conforms to specs4. Service – Meets consumer
expectations
Scientific data 1. Experimental reading – timed
a. Time dep. datumb. Time dep. datum. relationship
2. Experimental reading – untimeda. Datumb. Datum relationshipc. Instrument dependent datum
3. Experimental conditions – timeda. Time dep. instrumentb. Time dep. Instr. relationship
4. Experimental conditions – untimeda. Instrumentb. Instrument relationshipc. Datum- dependent datum
![Page 9: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/9.jpg)
Dimensions in the three areas
DB & IS data Scientific data Web dataDim in DB & IS data # of occur. Dim in scientific data # of occur. Dim. In web data # of occur.Accuracy 7 • Metadata quality several Coherence 4Completeness 4 Consistency 7 Completeness 4Currency 3 Accuracy 6 Trustworthiness 3Value 3 •Not specified 4 Accuracy 3Extensional joinability on tuples 2 • Completeness 4 Timeliness 3Relevance 2 • Currency 4 Credibility 3Reliability 2 Languages for espressing quality dimens. 4 Currency 2Timeliness 2 • Sensor calibration 3 Accessibility 1Consistency 2 • Relevance 3 Beleivability 1Accessibility 1 • Availability 3 Blurriness 1Believability 1 • Timeliness 3 Clarity 1Completeness of tuples 1 • Integrity 2 Comprehensiveness 1Completeness of values 1 • Readability 2 Conciseness 1Conditional functional dependencies 1 • Reputation 2 Consistency 1Confidence 1 • Validity 2 Convenience 1Damage to the decision 1 • Accessibility - easy of access 1 Descriptive power of the content 1Extensional joinability on groups of tuples 1 • Accessibility – efficiency 1 Discriminative power of the content 1Identifiability 1 • Easy interchange 1 Functionality 1Integrability 1 • Easy integration 1 Maintenance 1Intensional Joinability 1 • Intepretability 1 Objectivity 1Interpretability 1 • Confidence 1 Reliability 1Precision 1 • Reliability 1 Sharpness 1Rectifiability 1 • Preservation 1 Speed 1Relative information completeness 1 • Interoperability 1 Usability 1Trust 1 • Self description 1 Usefullness 1
• Credibility 1in blue: new dimensions in the DB&IS area • Information value 1 in blue: new DQ dimensions
• Classification correctness 1Completeness of the metadata 1Precision 1Outliers identification 1Link to publications 1of ontologies/annotations (?) 1Measurement error 1
in red: Gray's Dimensions
DB&IS data Scientific data Web data
![Page 10: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/10.jpg)
Novelties in DB&IS
![Page 11: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/11.jpg)
DB&IS-2 Batini & Elmagarmid Data Quality grounded in the more general
framework of information economics
D&I quality
Informationcapacity
Information utility
D&I structure
D&I diffusion
Information value
Data structure
ICT Infrastructure
Business Process architecture
Costs
accuracy
Set of DBs
DI technologies
set of processes
∆ new queries
ROI∆ new utility
![Page 12: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/12.jpg)
12
DB&IS - 2 Batini et al. Fil rouge among DQ dimensions in the different representations and levels of information
Structured data instances
12
Unstructured data
Maps/Images
Logical/Conceptual Schemas
Ontologies
Batini – Tutorial ER Barcelona 2008
![Page 13: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/13.jpg)
1313
C. Batini Tutorial ER Barcelona 2008 Map at a glance of dimensions/types of info
Quality Dimension Cluster
Structured data Geographic Maps Images Unstructured Texts
Laws and legal frameworks
Correctness/Accuracy/Precision
Schemaw.r.t requirements w.r.t. the model
InstanceSyntacticSemantic
Domain dependent (ex. Last Names, etc.)
InstanceSpatial accuracy
- Relative/Absolute/ Relative Inter layer- Locally increased r.a/ External/Internal- Neighbourhood a.- Vertical/Horizontal/Height
Attribute accuracy.Domain dependent accuracy(ex. Traffic at critical inters, etc.)Acc. of raster represntation
AccuracySyntactic Semantic“Reduced” semanic
GenuinenessFidelity Naturalness
AccuracySyntactic Semantic
Structural similarity
Accuracy PrecisionObjectivity IntegrityCorrectnessReference accuracy
Completeness/Pertinence
SchemaCompletnessPertinence
InstanceValue C., Tuple C., Column C., Relation C., Database C.
Completeness (btwdifferent datasets)
Pertinence
Completeness Completeness ObjectivityCompleteness
Temporal Currency – Timeliness - Volatility Recency/ Temporal accuracy/ Temporal resolution
Minimality/Redundancy/Compactness/ Cost
SchemaMinimalityRedundancy
Redundancy Minimality For a law: ConcisenessFor a legal framework: Minimality, Redundancy
Consistency/Coherence/ Interoperability
InstanceIntrarelational ConsistencyInterrelational ConsistencyInteroperability
ConsistencyObject consistency / Geometric
consist.Topological consist.
Interoperability
Interoperability CohesionReferential Temporal, Locational, Causal, StructuralCoherence
Lexical/Nonlexical
CoherenceConsistency among lawsConsistency among legal frameworks
Readability/Comprehensibility/Usability/Usefullness Intpretability
Schema Diagrammatic ReadabilityCompactness Normalization
Instance Readability/Legibility Clarity Aesthetics
Readability, Usefullness
ReadabilityComprehensibility
ClaritySimplicity
Accessibility… InstanceTechnological Channel Physical (W3C)
Instance Privacy
Physical Accessibility (W3C)
Cultural Accessibility Accessibility of the consolidatedAct on a given domain
Others Lineage EffectivenessLineageAdaptation
Effectiveness, TransaprencyUsefullnessApplicability, Accountability
![Page 14: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/14.jpg)
14
DB&IS-3 Batini et al. Fil rouge among DQ dimensions in the different representations and levels of information
Structured data instances
14
Unstructured data
Maps/Images
Logical/Conceptual Schemas
Ontologies
Batini – Tutorial CAISE Hammameth 2010
![Page 15: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/15.jpg)
DB&IS-4 New models/techniques for DQ 1. Geerts et al. - From OWA vs CWA to hybrid
OWA/CWA (Master data) relative information completeness
2. Srivastava et al. – From record linkage to temporal RL and group RL
3. Elmagarmid et al. – learning in data repair from user feedback
4. Borgida & Mylopoulos - formal framework based on the theory of linguistic signs
5. Geerts et al. Integrity constraints strike again… for repair for object identification
15
![Page 16: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/16.jpg)
DQ & Scientific dataan unconscious marriage
16
![Page 17: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/17.jpg)
1717
Main technologies in SD
• Networks of sensors• Cloud• SOA• Multivore computers• Workflow languages
![Page 18: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/18.jpg)
1818
Data sources in scientific data
• Observation and measurement of phenomena through sensors (raw data, images, etc.)
• Databases• Web
![Page 19: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/19.jpg)
19
Types of environmental sensor networks
19
![Page 20: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/20.jpg)
Types of data in DB&IS and SD
20
Types of data in a typical scientific data workflow
• Device monitoring data, for diagnostic and proactive maintenance purposes
• RD Raw data, produced by instrumen-tation
• CD Calibrated data, result of removing instrumental and env. effects
• DD Derived data, result of fusion of CDs and/or other DDs
• AD Assimilated data, result of gridding, resampling changing the frame of ref of DDs
• Model data, produced by applying math. stat. or stochastic models to DDs or ADs
Types of DB&IS data in the Mit Total Data qualityManagement Methodology
• Raw data,
• Components data,
• Information Products
![Page 21: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/21.jpg)
2121
Types of data in…
Row/Component/Information ProductsOperational data
Control data
Decisionaldata
Raw data
CalibratedCurated/normalized
data
Derived/AssimilatedModel/Aggregated data
Peer reviewedpapers
Business and administrative data Scientific data
![Page 22: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/22.jpg)
Nine main characteristics of DQ in SD
1. Multiple, nested, interdependent combinations of RD, CD, DD, and ADs SIMILAR IN DB&IS-D
2. Quality is strongly dependent upon integrity of algorithms and production process. SIMILAR IN DB&IS-D
3. Quality depends not only on the state of the observing system, but also on the state of the observed system ≠ IN DB&IS-D
4. Continual refinement (for example, data cleansing) is often not performed, because the data product typically represents an observation at a given point in time ≠ IN DB&IS-D
22
![Page 23: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/23.jpg)
Main characteristics wrt DQ in SD1. Data cleaning has limited applicability because there is
often not a way to know the “correct” values. ≠ IN DB&IS-D2. Repairing databases for duplicates or multiple data
representations is typically not a concern. ≠ IN DB&IS-D3. Data do not tend to become incorrect over time—the
purpose of collecting the science data is to capture the state of the observed system at that time. ≠ IN DB&IS-D
4. DQ improves with time as new insights are gained regarding what constitutes quality in the product itself or within the production process. SIMILAR IN DB&IS-D
5. Requires metadata so consumers can make judgments about validity and applicability of results SIMILAR IN DB&IS-D
23
![Page 24: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/24.jpg)
The quality life cycle in DB&IS/SDComputer Science perspective• Assess• Fix new quality targets• Choose improvement activities
– Data driven (e.g. record linkage)– Process driven (e.g. BPR)
• Evaluate costs• Improve • Monitor
SD perspective for digital libraries of SD
• Equipment selection• Equipment calibration• Ground truthing• Tresholding
24
SD perspective for RFID Data streams Cleaning
• Point• Smooth• Merge• Arbitrate• Virtualize
Statistical perspective– Preliminary screening for DQ– Exploratory analysis– Data inspection & anomaly detection– DQ improvement
– Standardize– Remove duplicates– Edit
![Page 25: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/25.jpg)
2525
The contribution of Missier et al.
Basic assumption - The meta workflow in eScience has a standard structure, while specific activities may vary significantly
Typical data processing pipeline in biology
![Page 26: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/26.jpg)
2626
Challenges and Solutions
• Sciencists don’t need a “one-sixe-fits-all” data quality dimensions, rather they need for domain dependent dimensions
• The quality workflow should be generated automatically from higher level specs
• DQ services should be reusable accross scientific communities and adaptable
• A Quality View is defined thatembodies the user scientists’ personal criteria for data acceptability.
• An Ontology-basedrepresentation is proposed forthe definition, enabling toautomatically compile QualityViews into executable qualityprocesses that can beintegrated with the users’ data processing workflow.
• Provenance metadata toenforce reusability and shearability.
![Page 27: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/27.jpg)
2727
Base workflow and quality view
![Page 28: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/28.jpg)
Other relevant contributions in SD – Receptors and Sensor Networks
• Various cleaning strategies for RFID/Other receptors Sensor Networks Data (usually 30% of tags readings are dropped)– Improved strategies to fix optimal
window size in smoothing filter based cleaning techniques
– Parametrization of the Point/Smooth/Merge/ Arbitrate life cycle to other receptors & network stages (e.g. dropped messages)
– Extension to outlier discovery– Early assessment of DQ in the
network and propagation for user feedback 28
![Page 29: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/29.jpg)
A lot of contributions on
• Guidelines• Lessons learned• Frameworks • Shared Experiences• Analyses• And, finally,• The policy of assessing scientific data
with the same method as scientific papers, Peer Review (ex. Molecules)
29
![Page 30: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/30.jpg)
Contributions on DQ and Web data
30
![Page 31: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/31.jpg)
WD1 - Weikum et al. works on blurrinessin Web archiving
1. Quality conscious scheduling strategies In web archiving, crawlers should gather sharp captures of entire Web sites, but in practice crawlings areperformes while Web sites undergo changes. A stochastically optimal crawl algorithm is proposed.
31
![Page 32: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/32.jpg)
WD2 – other relevant contributions
2. Trustworthiness (TR) and Credibility (CR) first class citizens
• TR Assessment techinques where data Tr & source Tr. are considered together (e.g. Bertino data path among criteria)
• CR Flanagin & Hilligoss present a unifying framework for credibility assessment – Across a variety of media and resources, and – For volunteered geographic information provision
3. Separate the wheat from the chaff • Srivastava et al. The role of copying/dependencies
among sources in the Web32
![Page 33: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/33.jpg)
The future• More investigation on Trustworthiness &
Credibility• DQ in linked data• A parametric methodology tailored to
different data representations, dimensions, assessment and improvement activities
• Cost and benefits of Data Quality• Assessment and improvement on the fly by
Context & Provenance• Data and Schema quality together
33
![Page 34: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/34.jpg)
3434
References• J. Almeida et al. On the quality f Information for Web 2.0 Services, , IEEE Internet
Computing, 2010.• I. Askira Gelman – Setting priorities for data accuracy improvements in satisficing decision-
making scenarios: a guiding theory, Decision Support Systems, 2009.• I. Askira Gelman – GIGO or not GIGO: te accuracy of multicriteria Satisficing Decisions,
JDIQ, 2011.• A. Avenali, C. Batini- Brokering infrastructure for minimum cost data procurement based on
quality - quantity models - Decision Support Systems, 2008.• C. Batini, M. Scannapieco, Data Quality: Concepts, methodologies, techniques, Springer
Verlag 2006.• C. Batini, Quality of Data, Textual Information and Images: a comparative survey, Tutorial
at ER Confenrence, Barcelona, 2008.• C. Batini, D. Barone, F. Cabitza, G.i Ciocca, F. Marini, Gabriella Pasi, R. Schettini: Toward a
Unified Model for Information Quality. QDB/MUD 2008: 113-122• C. Batini et al, Methodologies for data quality assessment and Improvement, Computing
Surveys, 2009.• C. Batini et al. - A capacity and value based model for data architectures adopting
integration technologies Amcis 2011• C. Batini et al. A Data Quality Methodology for Heterogeneous data, International Journal
of Database Management System (IJDMS) Volume 3, Number 1, February 2011.• L. Berti, D. Srivastava et al, Sailing the Information Ocean with awareness of currents:
discovery and application of source Dependence, CIDR 2009.• C. Borgman et al Drowning in Data: Digital Library Architecture to supporto Scentific use of
Embedded Sensor Networks, JCDL 2007.
![Page 35: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/35.jpg)
3535
References• H. Bu iet al. Experience with BX Grid: a data repository and compuntig grid for biometrics
research. Cluster Comput, 2009. • C. Cappiello et al, Do you Trust in DQ, Daghsthul Seminar, 2003.• C. Cappiello, B. Pernici et al. HIQM: a methodology for IQ Monitoring, Measurement and
Improvement, ER Qorkshops, 2006. • C. Cappiello et al. Information Quality in Mashups, IEEE Internet Computing, 2010.• H. Chen et al – Leveraging Spatio-Temporale Redundancy for RFID Data cleansing, SIGMOD 2009• F. Chaing & R. Miller Discovering Data Quality Rules, PVLDB 2008• M. Comerio et al, Service Oriented Quality Rngineering and Data Publishing in the Cloud, SECO
2009.• A. Corso-Radu et al DQ Monotiring Framework for the ATLAS Experiment at the LHC, 2008.• F. Daniel et al. Managing DQ In Business Intelligence application, Workshop at VLDB 2006.• C. Dai, E. Bertino et al. An approach to evaluate Data Trustworthiness based on Data Provenance,
SDM 2008.• F. Daniel, F. Casat iet al, Managing Data Quality in BI applications, VLDB workshops, 2008.• D. Denev et al. SHARC: Framework for quality Conscious Web Archiving, VLDB 09, Lyon, France.• M. Diepenbrock et al PANAGEA an information system for environemental sciences, Computer
Geosciences, 2002.• W. Fan, F. Geerts – Capturing missing tuples and missing values, PODS 2010.• W. Fan, F. Geerts – Relative information Completeness, ACM Transactions on Database Systems,
2010.
35
![Page 36: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/36.jpg)
36
References• W. Fan, F. Geerts – A revival of integrity constraints for data cleaning, Tutorial at
VLDB 2008• W. Fan, F. Geerts – Determining the currency of Data, Pods 2011. • J. Einbinder et al – Towards a framework for DQ in Health Care, ICIQ 2005• A. Flanagin et al The credibility of volunteerend geographic information, • Geo Journal 2008.• I. Gallegos, Ann Q. Gates, Craig Tweedie - Toward Improving Environmental Sensor
Data Quality: A Preliminary Property Categorization, 2010, unpublished• I. Gallegos, Ann Q. Gates, Craig Tweedie - DataPros: A data property Specification Tool
to Capture Scientific Sensor Data Properties, ER Workshops, 2010.• J. Gray et al, Scientific Data Msanagement in the coming Decade, Sigmod Record, 2005.• S. Jeffery, J. Widom et al A Pipelined framework for Online cleaning od Sensor Data
Streams, 2005• A. Jeffery et al. – Adaptive cleaning for RFID Data Streams, VLDB 06• L. Jiang, A. Bordiga, and J. Mylopoulos – Towards a compositional Semantic Account of
DQ Attributes, ER 2008.• J. Hart et al Environmental Sensor Netwroks: a revolution in the earth system science?
Earth Science Reviews 2006.• B. Hilligoss et al, Developing a unifying framework of credibility assessment: Construct,
heuristics, and interaction in context, Inf. Processing and Management, 2007.
36
![Page 37: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/37.jpg)
37
References• A. Karr et al. Data quality: A statistical perspective, Statistical methodology,
2006.• V. Kashyap, Trust, but Verify! Emergence, Trust, and Quality in Intelligent
Systems, IEEE Intelligent Systems, 2006• A. Klein et al. Representing DQ in Sensor Data Streaming Events, ACM JDIQ,
2009• A. Klein et al, How to optimize the Quality of Sensor Data Streams, Fourth
International Multi-Conference on Computing in the Global Information Technology, 2009.
• A. Klein G. Weikum et al, Representing Data Quality in Sensor data streaming environments, JDIQ, 2009
• S. Knight et al Develping a framework for Assessing IQ on the Web, Informing Science Journal, 2005.
• P. Li, D. Srivastava et al Linking Temporal Records, VLDB 2011.• D. MacDoullag et al Guidelines for Data Acquisition and Data Quality Evaluation in
environmental chemistry, America chemical society, 1980.• S. Madnick et al - Overview and Framework for Data and Information Quality
Research, JDI 2009.• P. Misser – The information quality prolbem in eSciences, PhD Thesis, 2007
37
![Page 38: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/38.jpg)
38
References• P. Missier et al. Exploiting provenance to make sense of Automated Decisions in
Scientifc Workflows, IPAW 2008• P. Missier et al. Quality Views, Capturing and Exploiting the User Perspective on
DQ, VLDB 06• P. Missier et al. Managing IQ in eScence: the Qurator workbench, Sigmod 2007 • P. Missier et al. An Ontology based Approach to Handling IQ in eScience,
Report, 2007.• A. Preece, P. Misser et al. Managing IQ in eScience using Semantic Web
Technology, Report, 2008.• R. Miller - Efficient Management of Inconsistent and uncertain Data, Informal
Proceedings of the Second International Workshop on Business Intelligence for the Real-Time Enterprise, BIRTE 2008.
• B. On, D. Srivastava, Group Linkage, 2007.• J. Rao et al, A Deferred Cleansing Method for RFID Data Analytics, VLDB ‘06.• N. Radziwill, Foundations for Quality Management of Scientific Data product,
Quality Management Journal, 2006• N. Radziwill, 2006: Quality Management of Astronomical Software and Data
Systems. ADASS XVI, 2006.• C. Rodriguez, F. Casati et al, Toward Uncertain business intelligence – IEEE
Internet Computing, 2010. 38
![Page 39: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS](https://reader035.fdocuments.us/reader035/viewer/2022071116/5ffd0eb4b9295a4fe47f480e/html5/thumbnails/39.jpg)
39
References• A. Risk et al Review of Internet Healty Infomation Quality Initiatives, Journal of
medical Internet research, 2001• Y. Simmahan, et al Towards a qulaity model for Effrective Data Selection in
Collaboratories, 2006.• M. Spaniol, G. Weikum et al. Data Quality in Web Archiving, WICOW 09. • C. Sun et al. Study on DQ management and DQ control – A case study of the earth
system science data sharing projects, 2009.• H Truong, S. Dustar, On Analyzing and specifiying Concerns for Data as a Service,
APSCC, 2009. • H. Xueqin et al. Research on DQ of Chinese Medicine Scientific Data,
Modernizaiotn of Trad. Chinese Medicine, 2009.• M. Yakout, A Elmagarmid et al, Guided Data Repair, VLDB 2011 B. Riley, Nature –
Systems: trusting data’s quality, 2006• E. Wallis et al. Know Thy sensor: trust, data quality, and Data Integrity in
Scientific Digital libraries, Lecture Notes in Computer Science, 2007.• S. Zahedi et al A computational framework for QoI Analysis for Detection-
Oriented Sensor Networks, MILCOM 2008.• S. Zahedi et al. Information quality aware Sensor Network Services, Asilomar conf,
2008.• Y. Zhou et al A SOA based DQ Assessment Framework in a Medical Science Center,
39