Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Post on 17-Jan-2018

216 views 0 download

description

Approaches Towards Data Quality  The usual approach towards data quality is the Reporting View  Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions Some frequently used dimensions: Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... 3Grossmann, Denk

Transcript of Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Towards a Process Oriented View on Statistical Data Quality

Michaela Denk, Wilfried Grossmann

Contents Approaches Towards Data Quality Example Data Integration A Generic Statistical Workflow Model Quality Assessment Conclusions

2Grossmann, Denk

Approaches Towards Data Quality The usual approach towards data quality is the

Reporting View Define a number of so called quality dimensions

and evaluate the final product according to criteria for these dimensions

• Some frequently used dimensions:Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,...

3Grossmann, Denk

Approaches Towards Data Quality These dimensions are many times broken down

in sub-dimensions• Example Accuracy:

Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error, ....

Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom

4Grossmann, Denk

Approaches Towards Data Quality If we have a number of different opportunities for

data production such an approach is probably not the best one Compare the ideas of Total Quality Management

(TQM) in industrial production:Systematic treatment of the influence of different production steps on quality of the final product

We need a Processing View on data quality: How is data quality influenced by production?

5Grossmann, Denk

Approaches Towards Data Quality How can we arrive at a Processing View on data

quality? We need a statistical workflow model We have to organize the processing information

necessary for quality assessment in appropriate way

• Compare (old) ideas of B. Sundgren about capture of metadata

6Grossmann, Denk

Approaches Towards Data Quality We have to know functions for assessing quality

Output_Quality = F(Input_Quality, Processing_Quality)

Such functions have to be applied according to • The object we are interested in, e.g. a variable

or a population or a classification• The quality aspect we are interested in

7Grossmann, Denk

Example Data Integration Data integration occurs many times in

statistical data production, in particular in case of data production from administrative sources

It uses a number of operations usually understood as data pre-processing

Basic goal: Combine information from two or more already existing data sets

8Grossmann, Denk

Example Data Integration Example for a Data Integration Dataflow Input → Integration → Post-

alignment

9Grossmann, Denk

V1 (Gender)

Matching Key

Matching Key

V1 (Gender)

V2 (Status)

Source Data D1

Source Data D2

V1 (Gender)

Matching Key

V1 (Gender)

V2 (Status)

Data after IntegrationV1

(Gender)Matching

KeyV2 (Status)

Data after Post-alignment

Example Data Integration Top level task description

Match the datasets according matching key Align V1 (gender) Align V2 (status)

10Grossmann, Denk

Example Data Integration Details, Decisions to be made Are datasets appropriate?

• Quality of matching keys• Quality of data sources

Method for identification of matches? Method for handling ambiguities in V1 (Gender)? Method for imputation of V2 (Status)? How is quality measured

• At level of a summary measure?• At level of a specific variable?• At level of individual records?

11Grossmann, Denk

Example Data Integration There are no generally accepted standard tools

and methods for answering such questions Probably we have to compare a number of

alternative approaches Apply the generic format for different datasets Try different statistical methods and models Use different methods for quality assessment

• Traditional formulas• Simulation based evaluation• Assessment by using strategic surveys

12Grossmann, Denk

Example Data Integration Conclusion Different statistical methods may be an essential

part of data production and quality assessment There is no longer such a clear distinction

between “objective” data collection and statistical analysis

Statistics generates added value beyond (administrative) accounting and IT

13Grossmann, Denk

A Generic Statistical Workflow Model Statistical Workflow: A mixture from

Business Workflow (Process oriented) Scientific Workflow (Data oriented)

Quality evaluation is the main control element of the process

We have to consider the workflow at two levels Meta-level (Control of the process) Data-level (Production of data)

14Grossmann, Denk

A Generic Statistical Workflow Model Building blocks of the workflow model

Transformations (Basic data operations) Process components (Tasks) defined by:

• Task definition• Pre-Alignment• Feasibility Check• Main Transformation• Post-Alignment• Quality Evaluation

Workflow (Sequence of Process components)

15Grossmann, Denk

A Generic Statistical Workflow Model Example for Data Integration Component

Workflow

16Grossmann, Denk

A Generic Statistical Workflow Model In order to understand how statistics influences

the boxes and data quality let us zoom into the box for post-alignment

17Grossmann, Denk

Quality Assessment For quality assessment we need a detailed

description of the changes in meta-information during the dataflow

18Grossmann, Denk

Quality Assessment Example for meta-

information flow in data integration

Details for register based census in the presentation of Fiedler/Lenk in

Session 26 (Thursday)

19Grossmann, Denk

Quality Assessment Example:

Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example

20Grossmann, Denk

Quality Assessment V1 (Gender) Input

• Coincidence of matching keys in both datasets• Matching of the variable Gender in both datasets• Beliefs about quality of the variable in both sources

Accuracy Assessment• It seems that models developed in decision analysis

(calculus from belief networks) are appropriate • Alternatively we can use a strategic sample to check

whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments

21Grossmann, Denk

Quality Assessment V2 (Status): Input

• Coincidence of matching keys in both datasets• Reliability of the model used for imputation• Measurement technique for quality of imputation

Accuracy Assessment• In this case we can apply traditional statistical

techniques like false classification rate, ROC-curve, simulation

22Grossmann, Denk

Conclusions We have presented a model, which allows tighter

coupling of quality assessment to the data production process

Such a model seems useful if data production has more degrees of freedom

What data should be used? What techniques should be used

The approach allows identification of the different factors influencing quality

23Grossmann, Denk

Conclusions It allows formulation of precise questions about

possible alternatives and defines new issues for research in statistical data quality

Hopefully it helps to understand better the added value generated by statistics

24Grossmann, Denk

Thank you for attention

25Grossmann, Denk