Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

25
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossma

description

Approaches Towards Data Quality  The usual approach towards data quality is the Reporting View  Define a number of so called quality dimensions and evaluate the final product according to criteria for these dimensions Some frequently used dimensions: Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,... 3Grossmann, Denk

Transcript of Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Page 1: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Towards a Process Oriented View on Statistical Data Quality

Michaela Denk, Wilfried Grossmann

Page 2: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Contents Approaches Towards Data Quality Example Data Integration A Generic Statistical Workflow Model Quality Assessment Conclusions

2Grossmann, Denk

Page 3: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Approaches Towards Data Quality The usual approach towards data quality is the

Reporting View Define a number of so called quality dimensions

and evaluate the final product according to criteria for these dimensions

• Some frequently used dimensions:Accuracy, Relevance, Accessibility, Timeliness, Coherence, Comparability,...

3Grossmann, Denk

Page 4: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Approaches Towards Data Quality These dimensions are many times broken down

in sub-dimensions• Example Accuracy:

Sampling Effects, Representativity, Over-Coverage, Under-Coverage, Missing Values, Imputation Error, ....

Such an approach is fine as long as production of data follows a predefined scheme, which has limited degrees of freedom

4Grossmann, Denk

Page 5: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Approaches Towards Data Quality If we have a number of different opportunities for

data production such an approach is probably not the best one Compare the ideas of Total Quality Management

(TQM) in industrial production:Systematic treatment of the influence of different production steps on quality of the final product

We need a Processing View on data quality: How is data quality influenced by production?

5Grossmann, Denk

Page 6: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Approaches Towards Data Quality How can we arrive at a Processing View on data

quality? We need a statistical workflow model We have to organize the processing information

necessary for quality assessment in appropriate way

• Compare (old) ideas of B. Sundgren about capture of metadata

6Grossmann, Denk

Page 7: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Approaches Towards Data Quality We have to know functions for assessing quality

Output_Quality = F(Input_Quality, Processing_Quality)

Such functions have to be applied according to • The object we are interested in, e.g. a variable

or a population or a classification• The quality aspect we are interested in

7Grossmann, Denk

Page 8: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Example Data Integration Data integration occurs many times in

statistical data production, in particular in case of data production from administrative sources

It uses a number of operations usually understood as data pre-processing

Basic goal: Combine information from two or more already existing data sets

8Grossmann, Denk

Page 9: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Example Data Integration Example for a Data Integration Dataflow Input → Integration → Post-

alignment

9Grossmann, Denk

V1 (Gender)

Matching Key

Matching Key

V1 (Gender)

V2 (Status)

Source Data D1

Source Data D2

V1 (Gender)

Matching Key

V1 (Gender)

V2 (Status)

Data after IntegrationV1

(Gender)Matching

KeyV2 (Status)

Data after Post-alignment

Page 10: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Example Data Integration Top level task description

Match the datasets according matching key Align V1 (gender) Align V2 (status)

10Grossmann, Denk

Page 11: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Example Data Integration Details, Decisions to be made Are datasets appropriate?

• Quality of matching keys• Quality of data sources

Method for identification of matches? Method for handling ambiguities in V1 (Gender)? Method for imputation of V2 (Status)? How is quality measured

• At level of a summary measure?• At level of a specific variable?• At level of individual records?

11Grossmann, Denk

Page 12: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Example Data Integration There are no generally accepted standard tools

and methods for answering such questions Probably we have to compare a number of

alternative approaches Apply the generic format for different datasets Try different statistical methods and models Use different methods for quality assessment

• Traditional formulas• Simulation based evaluation• Assessment by using strategic surveys

12Grossmann, Denk

Page 13: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Example Data Integration Conclusion Different statistical methods may be an essential

part of data production and quality assessment There is no longer such a clear distinction

between “objective” data collection and statistical analysis

Statistics generates added value beyond (administrative) accounting and IT

13Grossmann, Denk

Page 14: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

A Generic Statistical Workflow Model Statistical Workflow: A mixture from

Business Workflow (Process oriented) Scientific Workflow (Data oriented)

Quality evaluation is the main control element of the process

We have to consider the workflow at two levels Meta-level (Control of the process) Data-level (Production of data)

14Grossmann, Denk

Page 15: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

A Generic Statistical Workflow Model Building blocks of the workflow model

Transformations (Basic data operations) Process components (Tasks) defined by:

• Task definition• Pre-Alignment• Feasibility Check• Main Transformation• Post-Alignment• Quality Evaluation

Workflow (Sequence of Process components)

15Grossmann, Denk

Page 16: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

A Generic Statistical Workflow Model Example for Data Integration Component

Workflow

16Grossmann, Denk

Page 17: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

A Generic Statistical Workflow Model In order to understand how statistics influences

the boxes and data quality let us zoom into the box for post-alignment

17Grossmann, Denk

Page 18: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Quality Assessment For quality assessment we need a detailed

description of the changes in meta-information during the dataflow

18Grossmann, Denk

Page 19: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Quality Assessment Example for meta-

information flow in data integration

Details for register based census in the presentation of Fiedler/Lenk in

Session 26 (Thursday)

19Grossmann, Denk

Page 20: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Quality Assessment Example:

Assessment of accuracy of variables V1 (Gender) and V2 (Status) in the example

20Grossmann, Denk

Page 21: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Quality Assessment V1 (Gender) Input

• Coincidence of matching keys in both datasets• Matching of the variable Gender in both datasets• Beliefs about quality of the variable in both sources

Accuracy Assessment• It seems that models developed in decision analysis

(calculus from belief networks) are appropriate • Alternatively we can use a strategic sample to check

whether our prior beliefs are correct and our decision rule is confirmed by statistical arguments

21Grossmann, Denk

Page 22: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Quality Assessment V2 (Status): Input

• Coincidence of matching keys in both datasets• Reliability of the model used for imputation• Measurement technique for quality of imputation

Accuracy Assessment• In this case we can apply traditional statistical

techniques like false classification rate, ROC-curve, simulation

22Grossmann, Denk

Page 23: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Conclusions We have presented a model, which allows tighter

coupling of quality assessment to the data production process

Such a model seems useful if data production has more degrees of freedom

What data should be used? What techniques should be used

The approach allows identification of the different factors influencing quality

23Grossmann, Denk

Page 24: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Conclusions It allows formulation of precise questions about

possible alternatives and defines new issues for research in statistical data quality

Hopefully it helps to understand better the added value generated by statistics

24Grossmann, Denk

Page 25: Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Thank you for attention

25Grossmann, Denk