Workshop on Global Scientific Data Infrastructures: The...

39
1 Workshop on Global Scientific Data Infrastructures: The Big Data Challenges Capri 12-13 May 2011 Carlo Batini University of Milano Bicocca [email protected] Data Quality

Transcript of Workshop on Global Scientific Data Infrastructures: The...

Page 1: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

111

Workshop on Global Scientific Data Infrastructures:

The Big Data ChallengesCapri 12-13 May 2011

Carlo BatiniUniversity of Milano Bicocca

[email protected]

Data Quality

Page 2: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

22

State of the art in 2006

State of the art in Data Quality for business/administrative data managed in relational data bases:

• Dimensions• Metrics• Models• Techniques• Methodologies for

assessment and improvement

2006

Page 3: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Considered about 65 papersaccording to three evolutive trends

DB&IS data

Web data

Scientific data

2006-2011

1980-2011

2000-2011

Page 4: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Main issues in the literature around DQ

Type of data repre-sentation

DQ Dimension

Formal Technique/Formal model

Data source

Applicationdomain

ICT Technologyconsidered

Type of assessment/Improvement method

Lot of them!

Lot of them!

• Generic• Administrative• Business• Aggregate DW• Scientific• Web• UGC, Newspapers and TVs

• Structured data• Semistrutured d.• Unstructured d.• Specific sd. (e.g.

laws)• Maps• Images• Hybrid

• (Relational) DBMS’s• Sensor Networks• Receptors (RFID)• Cloud/grid

• Biosciences & Genetics

• Chemistry• Earth & Environment• Health sciences• Metereology• Neurobiology• Zoology• Oceanography• Physics• Universe & Astronomy• Decision Support

Systems• Operational processes

Page 5: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

The three worldsDB&IS/Web/Scientific data

together

Page 6: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Bridge-authors

6

DB & IS authors

SD authors

WEB authors

Missier et al.

Geerts

Srivastava

Widom

Elmagarmid

Bertino

Gray

Borgida &Mylopoulos

Batini Cappiello

Weikum

Page 7: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Areas of contributions for DB&IS, Web and Sc data

7

Issue DB & IS Web ScDModel 26% 22% 22%Technique/Algorithm 20% 38% 28%Methodology 17% 7%Guidelines 14% 30%Framework 10% 9%Survey 17%Extension to new techn. 5%Analysis 5% 26% 4%Conjecture 5%

Page 8: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Classifications:DB&IS vs SD QD classifications

DB & IS data - 11. Intrinsic2. Contextual3. Representational4. Accessibility

DB & IS data - 21. Product – Conforms to specs2. Product – Meets consumer

expectations3. Service – Conforms to specs4. Service – Meets consumer

expectations

Scientific data 1. Experimental reading – timed

a. Time dep. datumb. Time dep. datum. relationship

2. Experimental reading – untimeda. Datumb. Datum relationshipc. Instrument dependent datum

3. Experimental conditions – timeda. Time dep. instrumentb. Time dep. Instr. relationship

4. Experimental conditions – untimeda. Instrumentb. Instrument relationshipc. Datum- dependent datum

Page 9: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Dimensions in the three areas

DB & IS data Scientific data Web dataDim in DB & IS data # of occur. Dim in scientific data # of occur. Dim. In web data # of occur.Accuracy 7 • Metadata quality several Coherence 4Completeness 4 Consistency 7 Completeness 4Currency 3 Accuracy 6 Trustworthiness 3Value 3 •Not specified 4 Accuracy 3Extensional joinability on tuples 2 • Completeness 4 Timeliness 3Relevance 2 • Currency 4 Credibility 3Reliability 2 Languages for espressing quality dimens. 4 Currency 2Timeliness 2 • Sensor calibration 3 Accessibility 1Consistency 2 • Relevance 3 Beleivability 1Accessibility 1 • Availability 3 Blurriness 1Believability 1 • Timeliness 3 Clarity 1Completeness of tuples 1 • Integrity 2 Comprehensiveness 1Completeness of values 1 • Readability 2 Conciseness 1Conditional functional dependencies 1 • Reputation 2 Consistency 1Confidence 1 • Validity 2 Convenience 1Damage to the decision 1 • Accessibility - easy of access 1 Descriptive power of the content 1Extensional joinability on groups of tuples 1 • Accessibility – efficiency 1 Discriminative power of the content 1Identifiability 1 • Easy interchange 1 Functionality 1Integrability 1 • Easy integration 1 Maintenance 1Intensional Joinability 1 • Intepretability 1 Objectivity 1Interpretability 1 • Confidence 1 Reliability 1Precision 1 • Reliability 1 Sharpness 1Rectifiability 1 • Preservation 1 Speed 1Relative information completeness 1 • Interoperability 1 Usability 1Trust 1 • Self description 1 Usefullness 1

• Credibility 1in blue: new dimensions in the DB&IS area • Information value 1 in blue: new DQ dimensions

• Classification correctness 1Completeness of the metadata 1Precision 1Outliers identification 1Link to publications 1of ontologies/annotations (?) 1Measurement error 1

in red: Gray's Dimensions

DB&IS data Scientific data Web data

Page 10: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Novelties in DB&IS

Page 11: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

DB&IS-2 Batini & Elmagarmid Data Quality grounded in the more general

framework of information economics

D&I quality

Informationcapacity

Information utility

D&I structure

D&I diffusion

Information value

Data structure

ICT Infrastructure

Business Process architecture

Costs

accuracy

Set of DBs

DI technologies

set of processes

∆ new queries

ROI∆ new utility

Page 12: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

12

DB&IS - 2 Batini et al. Fil rouge among DQ dimensions in the different representations and levels of information

Structured data instances

12

Unstructured data

Maps/Images

Logical/Conceptual Schemas

Ontologies

Batini – Tutorial ER Barcelona 2008

Page 13: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

1313

C. Batini Tutorial ER Barcelona 2008 Map at a glance of dimensions/types of info

Quality Dimension Cluster

Structured data Geographic Maps Images Unstructured Texts

Laws and legal frameworks

Correctness/Accuracy/Precision

Schemaw.r.t requirements w.r.t. the model

InstanceSyntacticSemantic

Domain dependent (ex. Last Names, etc.)

InstanceSpatial accuracy

- Relative/Absolute/ Relative Inter layer- Locally increased r.a/ External/Internal- Neighbourhood a.- Vertical/Horizontal/Height

Attribute accuracy.Domain dependent accuracy(ex. Traffic at critical inters, etc.)Acc. of raster represntation

AccuracySyntactic Semantic“Reduced” semanic

GenuinenessFidelity Naturalness

AccuracySyntactic Semantic

Structural similarity

Accuracy PrecisionObjectivity IntegrityCorrectnessReference accuracy

Completeness/Pertinence

SchemaCompletnessPertinence

InstanceValue C., Tuple C., Column C., Relation C., Database C.

Completeness (btwdifferent datasets)

Pertinence

Completeness Completeness ObjectivityCompleteness

Temporal Currency – Timeliness - Volatility Recency/ Temporal accuracy/ Temporal resolution

Minimality/Redundancy/Compactness/ Cost

SchemaMinimalityRedundancy

Redundancy Minimality For a law: ConcisenessFor a legal framework: Minimality, Redundancy

Consistency/Coherence/ Interoperability

InstanceIntrarelational ConsistencyInterrelational ConsistencyInteroperability

ConsistencyObject consistency / Geometric

consist.Topological consist.

Interoperability

Interoperability CohesionReferential Temporal, Locational, Causal, StructuralCoherence

Lexical/Nonlexical

CoherenceConsistency among lawsConsistency among legal frameworks

Readability/Comprehensibility/Usability/Usefullness Intpretability

Schema Diagrammatic ReadabilityCompactness Normalization

Instance Readability/Legibility Clarity Aesthetics

Readability, Usefullness

ReadabilityComprehensibility

ClaritySimplicity

Accessibility… InstanceTechnological Channel Physical (W3C)

Instance Privacy

Physical Accessibility (W3C)

Cultural Accessibility Accessibility of the consolidatedAct on a given domain

Others Lineage EffectivenessLineageAdaptation

Effectiveness, TransaprencyUsefullnessApplicability, Accountability

Page 14: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

14

DB&IS-3 Batini et al. Fil rouge among DQ dimensions in the different representations and levels of information

Structured data instances

14

Unstructured data

Maps/Images

Logical/Conceptual Schemas

Ontologies

Batini – Tutorial CAISE Hammameth 2010

Page 15: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

DB&IS-4 New models/techniques for DQ 1. Geerts et al. - From OWA vs CWA to hybrid

OWA/CWA (Master data) relative information completeness

2. Srivastava et al. – From record linkage to temporal RL and group RL

3. Elmagarmid et al. – learning in data repair from user feedback

4. Borgida & Mylopoulos - formal framework based on the theory of linguistic signs

5. Geerts et al. Integrity constraints strike again… for repair for object identification

15

Page 16: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

DQ & Scientific dataan unconscious marriage

16

Page 17: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

1717

Main technologies in SD

• Networks of sensors• Cloud• SOA• Multivore computers• Workflow languages

Page 18: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

1818

Data sources in scientific data

• Observation and measurement of phenomena through sensors (raw data, images, etc.)

• Databases• Web

Page 19: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

19

Types of environmental sensor networks

19

Page 20: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Types of data in DB&IS and SD

20

Types of data in a typical scientific data workflow

• Device monitoring data, for diagnostic and proactive maintenance purposes

• RD Raw data, produced by instrumen-tation

• CD Calibrated data, result of removing instrumental and env. effects

• DD Derived data, result of fusion of CDs and/or other DDs

• AD Assimilated data, result of gridding, resampling changing the frame of ref of DDs

• Model data, produced by applying math. stat. or stochastic models to DDs or ADs

Types of DB&IS data in the Mit Total Data qualityManagement Methodology

• Raw data,

• Components data,

• Information Products

Page 21: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

2121

Types of data in…

Row/Component/Information ProductsOperational data

Control data

Decisionaldata

Raw data

CalibratedCurated/normalized

data

Derived/AssimilatedModel/Aggregated data

Peer reviewedpapers

Business and administrative data Scientific data

Page 22: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Nine main characteristics of DQ in SD

1. Multiple, nested, interdependent combinations of RD, CD, DD, and ADs SIMILAR IN DB&IS-D

2. Quality is strongly dependent upon integrity of algorithms and production process. SIMILAR IN DB&IS-D

3. Quality depends not only on the state of the observing system, but also on the state of the observed system ≠ IN DB&IS-D

4. Continual refinement (for example, data cleansing) is often not performed, because the data product typically represents an observation at a given point in time ≠ IN DB&IS-D

22

Page 23: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Main characteristics wrt DQ in SD1. Data cleaning has limited applicability because there is

often not a way to know the “correct” values. ≠ IN DB&IS-D2. Repairing databases for duplicates or multiple data

representations is typically not a concern. ≠ IN DB&IS-D3. Data do not tend to become incorrect over time—the

purpose of collecting the science data is to capture the state of the observed system at that time. ≠ IN DB&IS-D

4. DQ improves with time as new insights are gained regarding what constitutes quality in the product itself or within the production process. SIMILAR IN DB&IS-D

5. Requires metadata so consumers can make judgments about validity and applicability of results SIMILAR IN DB&IS-D

23

Page 24: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

The quality life cycle in DB&IS/SDComputer Science perspective• Assess• Fix new quality targets• Choose improvement activities

– Data driven (e.g. record linkage)– Process driven (e.g. BPR)

• Evaluate costs• Improve • Monitor

SD perspective for digital libraries of SD

• Equipment selection• Equipment calibration• Ground truthing• Tresholding

24

SD perspective for RFID Data streams Cleaning

• Point• Smooth• Merge• Arbitrate• Virtualize

Statistical perspective– Preliminary screening for DQ– Exploratory analysis– Data inspection & anomaly detection– DQ improvement

– Standardize– Remove duplicates– Edit

Page 25: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

2525

The contribution of Missier et al.

Basic assumption - The meta workflow in eScience has a standard structure, while specific activities may vary significantly

Typical data processing pipeline in biology

Page 26: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

2626

Challenges and Solutions

• Sciencists don’t need a “one-sixe-fits-all” data quality dimensions, rather they need for domain dependent dimensions

• The quality workflow should be generated automatically from higher level specs

• DQ services should be reusable accross scientific communities and adaptable

• A Quality View is defined thatembodies the user scientists’ personal criteria for data acceptability.

• An Ontology-basedrepresentation is proposed forthe definition, enabling toautomatically compile QualityViews into executable qualityprocesses that can beintegrated with the users’ data processing workflow.

• Provenance metadata toenforce reusability and shearability.

Page 27: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

2727

Base workflow and quality view

Page 28: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Other relevant contributions in SD – Receptors and Sensor Networks

• Various cleaning strategies for RFID/Other receptors Sensor Networks Data (usually 30% of tags readings are dropped)– Improved strategies to fix optimal

window size in smoothing filter based cleaning techniques

– Parametrization of the Point/Smooth/Merge/ Arbitrate life cycle to other receptors & network stages (e.g. dropped messages)

– Extension to outlier discovery– Early assessment of DQ in the

network and propagation for user feedback 28

Page 29: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

A lot of contributions on

• Guidelines• Lessons learned• Frameworks • Shared Experiences• Analyses• And, finally,• The policy of assessing scientific data

with the same method as scientific papers, Peer Review (ex. Molecules)

29

Page 30: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

Contributions on DQ and Web data

30

Page 31: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

WD1 - Weikum et al. works on blurrinessin Web archiving

1. Quality conscious scheduling strategies In web archiving, crawlers should gather sharp captures of entire Web sites, but in practice crawlings areperformes while Web sites undergo changes. A stochastically optimal crawl algorithm is proposed.

31

Page 32: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

WD2 – other relevant contributions

2. Trustworthiness (TR) and Credibility (CR) first class citizens

• TR Assessment techinques where data Tr & source Tr. are considered together (e.g. Bertino data path among criteria)

• CR Flanagin & Hilligoss present a unifying framework for credibility assessment – Across a variety of media and resources, and – For volunteered geographic information provision

3. Separate the wheat from the chaff • Srivastava et al. The role of copying/dependencies

among sources in the Web32

Page 33: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

The future• More investigation on Trustworthiness &

Credibility• DQ in linked data• A parametric methodology tailored to

different data representations, dimensions, assessment and improvement activities

• Cost and benefits of Data Quality• Assessment and improvement on the fly by

Context & Provenance• Data and Schema quality together

33

Page 34: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

3434

References• J. Almeida et al. On the quality f Information for Web 2.0 Services, , IEEE Internet

Computing, 2010.• I. Askira Gelman – Setting priorities for data accuracy improvements in satisficing decision-

making scenarios: a guiding theory, Decision Support Systems, 2009.• I. Askira Gelman – GIGO or not GIGO: te accuracy of multicriteria Satisficing Decisions,

JDIQ, 2011.• A. Avenali, C. Batini- Brokering infrastructure for minimum cost data procurement based on

quality - quantity models - Decision Support Systems, 2008.• C. Batini, M. Scannapieco, Data Quality: Concepts, methodologies, techniques, Springer

Verlag 2006.• C. Batini, Quality of Data, Textual Information and Images: a comparative survey, Tutorial

at ER Confenrence, Barcelona, 2008.• C. Batini, D. Barone, F. Cabitza, G.i Ciocca, F. Marini, Gabriella Pasi, R. Schettini: Toward a

Unified Model for Information Quality. QDB/MUD 2008: 113-122• C. Batini et al, Methodologies for data quality assessment and Improvement, Computing

Surveys, 2009.• C. Batini et al. - A capacity and value based model for data architectures adopting

integration technologies Amcis 2011• C. Batini et al. A Data Quality Methodology for Heterogeneous data, International Journal

of Database Management System (IJDMS) Volume 3, Number 1, February 2011.• L. Berti, D. Srivastava et al, Sailing the Information Ocean with awareness of currents:

discovery and application of source Dependence, CIDR 2009.• C. Borgman et al Drowning in Data: Digital Library Architecture to supporto Scentific use of

Embedded Sensor Networks, JCDL 2007.

Page 35: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

3535

References• H. Bu iet al. Experience with BX Grid: a data repository and compuntig grid for biometrics

research. Cluster Comput, 2009. • C. Cappiello et al, Do you Trust in DQ, Daghsthul Seminar, 2003.• C. Cappiello, B. Pernici et al. HIQM: a methodology for IQ Monitoring, Measurement and

Improvement, ER Qorkshops, 2006. • C. Cappiello et al. Information Quality in Mashups, IEEE Internet Computing, 2010.• H. Chen et al – Leveraging Spatio-Temporale Redundancy for RFID Data cleansing, SIGMOD 2009• F. Chaing & R. Miller Discovering Data Quality Rules, PVLDB 2008• M. Comerio et al, Service Oriented Quality Rngineering and Data Publishing in the Cloud, SECO

2009.• A. Corso-Radu et al DQ Monotiring Framework for the ATLAS Experiment at the LHC, 2008.• F. Daniel et al. Managing DQ In Business Intelligence application, Workshop at VLDB 2006.• C. Dai, E. Bertino et al. An approach to evaluate Data Trustworthiness based on Data Provenance,

SDM 2008.• F. Daniel, F. Casat iet al, Managing Data Quality in BI applications, VLDB workshops, 2008.• D. Denev et al. SHARC: Framework for quality Conscious Web Archiving, VLDB 09, Lyon, France.• M. Diepenbrock et al PANAGEA an information system for environemental sciences, Computer

Geosciences, 2002.• W. Fan, F. Geerts – Capturing missing tuples and missing values, PODS 2010.• W. Fan, F. Geerts – Relative information Completeness, ACM Transactions on Database Systems,

2010.

35

Page 36: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

36

References• W. Fan, F. Geerts – A revival of integrity constraints for data cleaning, Tutorial at

VLDB 2008• W. Fan, F. Geerts – Determining the currency of Data, Pods 2011. • J. Einbinder et al – Towards a framework for DQ in Health Care, ICIQ 2005• A. Flanagin et al The credibility of volunteerend geographic information, • Geo Journal 2008.• I. Gallegos, Ann Q. Gates, Craig Tweedie - Toward Improving Environmental Sensor

Data Quality: A Preliminary Property Categorization, 2010, unpublished• I. Gallegos, Ann Q. Gates, Craig Tweedie - DataPros: A data property Specification Tool

to Capture Scientific Sensor Data Properties, ER Workshops, 2010.• J. Gray et al, Scientific Data Msanagement in the coming Decade, Sigmod Record, 2005.• S. Jeffery, J. Widom et al A Pipelined framework for Online cleaning od Sensor Data

Streams, 2005• A. Jeffery et al. – Adaptive cleaning for RFID Data Streams, VLDB 06• L. Jiang, A. Bordiga, and J. Mylopoulos – Towards a compositional Semantic Account of

DQ Attributes, ER 2008.• J. Hart et al Environmental Sensor Netwroks: a revolution in the earth system science?

Earth Science Reviews 2006.• B. Hilligoss et al, Developing a unifying framework of credibility assessment: Construct,

heuristics, and interaction in context, Inf. Processing and Management, 2007.

36

Page 37: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

37

References• A. Karr et al. Data quality: A statistical perspective, Statistical methodology,

2006.• V. Kashyap, Trust, but Verify! Emergence, Trust, and Quality in Intelligent

Systems, IEEE Intelligent Systems, 2006• A. Klein et al. Representing DQ in Sensor Data Streaming Events, ACM JDIQ,

2009• A. Klein et al, How to optimize the Quality of Sensor Data Streams, Fourth

International Multi-Conference on Computing in the Global Information Technology, 2009.

• A. Klein G. Weikum et al, Representing Data Quality in Sensor data streaming environments, JDIQ, 2009

• S. Knight et al Develping a framework for Assessing IQ on the Web, Informing Science Journal, 2005.

• P. Li, D. Srivastava et al Linking Temporal Records, VLDB 2011.• D. MacDoullag et al Guidelines for Data Acquisition and Data Quality Evaluation in

environmental chemistry, America chemical society, 1980.• S. Madnick et al - Overview and Framework for Data and Information Quality

Research, JDI 2009.• P. Misser – The information quality prolbem in eSciences, PhD Thesis, 2007

37

Page 38: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

38

References• P. Missier et al. Exploiting provenance to make sense of Automated Decisions in

Scientifc Workflows, IPAW 2008• P. Missier et al. Quality Views, Capturing and Exploiting the User Perspective on

DQ, VLDB 06• P. Missier et al. Managing IQ in eScence: the Qurator workbench, Sigmod 2007 • P. Missier et al. An Ontology based Approach to Handling IQ in eScience,

Report, 2007.• A. Preece, P. Misser et al. Managing IQ in eScience using Semantic Web

Technology, Report, 2008.• R. Miller - Efficient Management of Inconsistent and uncertain Data, Informal

Proceedings of the Second International Workshop on Business Intelligence for the Real-Time Enterprise, BIRTE 2008.

• B. On, D. Srivastava, Group Linkage, 2007.• J. Rao et al, A Deferred Cleansing Method for RFID Data Analytics, VLDB ‘06.• N. Radziwill, Foundations for Quality Management of Scientific Data product,

Quality Management Journal, 2006• N. Radziwill, 2006: Quality Management of Astronomical Software and Data

Systems. ADASS XVI, 2006.• C. Rodriguez, F. Casati et al, Toward Uncertain business intelligence – IEEE

Internet Computing, 2010. 38

Page 39: Workshop on Global Scientific Data Infrastructures: The ...datachallenges.isti.cnr.it/2011/files/Batini.pdf · Considered about 65 papers according to three evolutive trends. DB&IS

39

References• A. Risk et al Review of Internet Healty Infomation Quality Initiatives, Journal of

medical Internet research, 2001• Y. Simmahan, et al Towards a qulaity model for Effrective Data Selection in

Collaboratories, 2006.• M. Spaniol, G. Weikum et al. Data Quality in Web Archiving, WICOW 09. • C. Sun et al. Study on DQ management and DQ control – A case study of the earth

system science data sharing projects, 2009.• H Truong, S. Dustar, On Analyzing and specifiying Concerns for Data as a Service,

APSCC, 2009. • H. Xueqin et al. Research on DQ of Chinese Medicine Scientific Data,

Modernizaiotn of Trad. Chinese Medicine, 2009.• M. Yakout, A Elmagarmid et al, Guided Data Repair, VLDB 2011 B. Riley, Nature –

Systems: trusting data’s quality, 2006• E. Wallis et al. Know Thy sensor: trust, data quality, and Data Integrity in

Scientific Digital libraries, Lecture Notes in Computer Science, 2007.• S. Zahedi et al A computational framework for QoI Analysis for Detection-

Oriented Sensor Networks, MILCOM 2008.• S. Zahedi et al. Information quality aware Sensor Network Services, Asilomar conf,

2008.• Y. Zhou et al A SOA based DQ Assessment Framework in a Medical Science Center,

39