Introduction to Data Quality

21
Introduction to Data Quality Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 1. March 19, 2015. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 1 / 21

Transcript of Introduction to Data Quality

Introduction to Data Quality

Pradeeban Kathiravelu

INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa

Lisbon, Portugal

Data Quality – Presentation 1.March 19, 2015.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 1 / 21

Introduction

Introduction

Data is an important asset for the organizations.Data warehouses and exploration tools depend on data quality.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 2 / 21

Introduction

Data Quality Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 3 / 21

Introduction

1. DQ Problems within a Single Data Source

1.1. DQ Problems within a Single Relation

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 4 / 21

Introduction

1.1.1. An Attribute Value of a Single Tuple

Missing value.

Syntax violation.

Outdated value.

Interval violation.

Set violation.

Misspelled error.

Inadequate value to theattribute context.

Value items beyond theattribute context.

Meaningless value.

Value with imprecise or doubtfulmeaning.

Domain constraint violation.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 5 / 21

Introduction

Detecting DQ Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 6 / 21

Introduction

1.1.2. The Values of a Single Attribute

Uniqueness value violation.

Synonyms existence.

Domain constraint violation.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 7 / 21

Introduction

Detecting DQ Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 8 / 21

Introduction

1.1.3. The Attribute Values of a Single Tuple

Semi-empty tuple.

Inconsistency among attribute values.

Domain constraint violation.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 9 / 21

Introduction

Detecting DQ Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 10 / 21

Introduction

1.1.4. The Attribute Values of Several Tuples

Redundancy about an entity.

Inconsistency about an entity.

Domain constraint violation.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 11 / 21

Introduction

Detecting DQ Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 12 / 21

Introduction

1.2. Relationships among Multiple Relations

Referential integrity violation.

Outdated reference.

Syntax inconsistency

Inconsistency among related attribute values.

Circularity among tuples in a self-relationship.

Domain constraint violation.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 13 / 21

Introduction

Detecting DQ Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 14 / 21

Introduction

2. Multiple Data Sources

Syntax inconsistency.

Different measure units.

Representation inconsistency.

Different aggregation levels.

Synonyms existence.

Homonyms existence.

Redundancy about an entity.

Inconsistency about an entity.

Domain constraint violation.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 15 / 21

Introduction

Detecting DQ Problems

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 16 / 21

Introduction

Data Cleaning Problems (Rahm and Do)

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 17 / 21

Introduction

Phases of Data Cleaning

Data analysis.

Data profiling.Data mining.

Descriptive data mining models.Clustering, summarization, association discovery and sequencediscovery.

Definition of transformation workflow and mapping rules.

Verification.

Transformation.

Backflow of cleaned data.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 18 / 21

Introduction

Tool Support

Data analysis and reengineering tools.

Data profiling - MigrationArchitect.Data mining - WizRule and DataMiningSuite.Data reengineering - Integrity.

Specialized cleaning tools

Special domain cleaning - idCentric, PureIntegrate, QuickAddress,Reunion, and Trillium.Duplicate elimination - DataCleanser, Merge/PurgeLibrary, matchIT,and MasterMerge .

ETL (Extraction, Transformation, Loading) Tools

CopyManager, DataStage, Extract, PowerMart, DecisionBase,DataTransformationService, MetaSuite, SagentSolution, andWarehouseAdministrator.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 19 / 21

Introduction

Conclusions

Identification, classification and systematization of DQ problems.

Taxonomy using a bottom-up approach.

Definition of methods to detect DQ problems

represented as binary classification trees.

Thank you!Questions?

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 20 / 21

Introduction

References

Oliveira, P., Rodrigues, F., Henriques, P., & Galhardas, H. (2005,June). A taxonomy of data quality problems. In Proc. 2nd Int.Workshop on Data and Information Quality (in conjunction withCAiSE 2005), Porto, Portugal.

Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and currentapproaches. IEEE Data Eng. Bull., 23(4), 3-13.

Barateiro, J., & Galhardas, H. (2005). A Survey of Data QualityTools. Datenbank-Spektrum, 14(15-21), 48.

Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – ATaxonomy of Dirty Data. Data Mining and Knowledge Discovery, 7,2003. pp. 81-99.

Muller, H. and Freytag, J.-C. – Problems, Methods, and Challenges inComprehensive Data Cleansing. Technical Report HUB-IB-164,Humboldt University, Berlin, 2003.

Pradeeban Kathiravelu (IST-ULisboa) Data Quality 21 / 21