1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer...
-
Upload
alberta-chandler -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Managing Data Quality Dr Richard White Original version by Dr Mikhaila Burgess School of Computer...
1
Managing Data Quality
Dr Richard White
Original version by Dr Mikhaila Burgess
School of Computer Science & Informatics
Cardiff University
2
Session overview What is quality? What is Data Quality (DQ)? And
why is it important anyway? Potential impact of poor DQ (data quality) Defining Data Quality
Designing for Quality Data Ensuring DQ in databases
So what goes wrong? Potential causes of poor DQ
Managing DQ
… and some exercises
Designing for Quality DataEnsuring a level of quality in your databases
4
Database Quality: Design Designed to meet requirements Data entry process – data recorded that meets
requirements Normalised DB design
1NF: no duplicate records, one candidate key 2NF: 1NF, attributes only dependent on key 3NF: …
Access restrictions Data Integrity
5
Data Integrity The validity and consistency of stored data Integrity constraints
protect the database from becoming inconsistent. or, rules that the database is not permitted to
violate Constraints can be placed on
individual data items, relationships between tables
Time of application of constraints for example, on data entry
a Very brief overview of
6
Integrity Constraints Some types of integrity constraints
We’ll look at 4 types of integrity constraints Entity integrity Attribute domain constraints Referential integrity Business rules
Type Size Values
Range Not Null Unique
7
IC1: Entity Integrity Some fields must always contain data Key constraints
Each row in a table identified by a unique key Key fields cannot contain null values Primary keys must be unique
Examples Car number plate Engine serial number
8
IC2: Attribute Domain ConstraintsSome attributes can only take specific values …
Restrictive?
Gender Male or Female M or F Y (not allowed!)
Title Mr / Mrs / Ms
9
IC3: Referential Integrity Rules enforcing consistency in relationships
between tables in a database Primary and Foreign keys
Every foreign key in every table must match a primary key in another table.
If a foreign key exists in one table that refers to a specific row in another table, that other row should exist.
There should be no invalid references
10
Deleting a Primary Key Referential integrity not lost if no foreign key
references exist. If foreign key references DO exist in the database,
several possible actions: NO ACTION – do not allow deletion of the record CASCADE – allow the deletion, and automatically delete
all referencing rows SET NULL – set all the referencing foreign keys to null SET DEFAULT – set the referencing foreign keys to a
default value
11
IC4: Business Rules Also called ‘enterprise constraints’ Data constraints specific to organisation Examples
Manager can only be responsible for up to 20 people
Library: maximum of 2 short-loan books at any one time
Bowling: Max of 26 lanes, and 8 people per lane
12
Data Consistency A consistent database - all integrity
constraints are satisfied Two possible approaches
1. Only allow data into the database if it is valid and meets integrity constraints
2. Allow data into the database, then check/clean later
13
Measuring Data Quality Tools (Trillium, Datanomic, etc) Nic Caine
Lights Out Integrity Subsystem Quality reporting
Redman Estimating data quality
Wang and Strong Cell level tagging Data quality algebra
14
Estimating Data Quality (Redman) Estimating quality levels is difficult. At least two methods for calculating database
errors: the record and field methods Field method:
Record method:
(Number of erred fields / Total number of fields) * 100
(Number of erred records / Total number of records) * 100
15
Estimating Data Quality Calculating error
rates: Field method:
(6 / 50) * 100 = 12% Record method:
(5 / 10) * 100 = 50%
Field1 Field2 Field3 Field4 Field5
Record 1
Record 2 X
Record 3 X X
Record 4
Record 5
Record 6 X
Record 7
Record 8
Record 9 X
Record 10 X
16
Quality Entity-Relationship Diagram(Wang & Strong)
Customer StocksTrades
AccountNo
Name
Address
Telephone
Date
Buy_Sell
Quantity
Price
CurrentPrice
TickerSymbol
ResearchReport
17
Quality Entity-Relationship Diagram(Wang & Strong)
Customer StocksTrades
AccountNo
Name
Address
Telephone
Date
Buy_Sell
Quantity
Price
CurrentPrice
TickerSymbol
ResearchReport
timeliness
timeliness
cost
credibilityformat
interpretability
So what goes wrong?Some causes of poor quality data & information
19
Data Entry: Human Aspect Unintentional errors in data entry Lack of understanding Poor Training Intentional incorrect data entry
Malicious / Non-malicious
Poorly defined or out-of-date collection process
Multiple levels of data entry
Garbage in, Garbage out
20
Some examples of poor IQTV: INTERNET: PHONE: MOBIILE
Mizuho Securities December 2005 One share @ 610,000 yen (£2,893) 610,000 shares @ 1 yen (0.47p) Lost over 27bn yen (previous year net
profit 28.1bn yen) Government ordered enquiry
http://news.bbc.co.uk/1/hi/business/4512962.stm
21
Organ Donor Register
“The mistake occurred in 1999 when a coding error on driving licences wrongly specifying donors’ wishes was transferred to the organ registry.”
Last year - NHS Blood and Transplant wrote to new donors with details of consent
800,000 individuals’ details recorded incorrectly
11 April 2010
400,000 changed; 400,000 to be contacted 45 people since died; 21 incorrect donations
www.timesonline.co.uk/tol/life_and_style/health/article7094454.ece
22
Data Entry: Technical Aspect Inaccurate measuring or counting device Errors in the data storage process Missing data fields Data scanner
Poor quality data scanner Inappropriate scanner
Microfiche Microfilm Aperture cards
Incorrect set-up
23
Herbarium Catalogue Approx 7 million specimens
Pressed & dried Preserved in spirit
30,000 per year HerbCat
www.kew.org/herbcat/ ePIC – electronic Plant
Information Centre www.kew.org/epic/
24
Type Specimen Over 350,000 Original specimen Fixed species name &
description
18th century Reference point for
botanists – applying names correctly (taxonomy & systematics)
http://www.kew.org/collections/herb_types.html
25
Random Data
“The snafu started when police used the address as part of what Browne called “random material’’ to test an automated computer system that tracks crime complaints and records of
other internal police information”
Thursday 18th March 2010 – NYPD’s Identity Theft Squad deliver cheesecake to Walter (83) and Rose (82) Martin, Brooklyn, NY
50 raids over 8 years
50 errant visits blamed on computer glitch
Apologise & explain … and to check people “weren’t using that address for identity theft”
Cops Sorry For Coming To Wrong Home 50 Times
(Associated Press & Boston Globe)
26
10 Potholes to IQ#1 Multiple sources of the same information produce different values.
#2 Information is produced using subjective judgments, leading to bias.
#3 Systemic errors in information production lead to lost information.
#4 Large volumes of stored information make it difficult to access information in a reasonable time.
#5 Distributed heterogeneous systems lead to inconsistent definitions, formats, and values.
#6 Nonnumeric information is difficult to index.
#7 Automated content analysis across information collections is not yet available.
#8 As information consumers’ tasks and the organisational environment change, the information that is relevant and useful changes.
#9 Easy access to information may conflict with requirements for security, privacy, and confidentiality.
#10 Lack of sufficient computing resources limits access.
(Strong et al 1997)
27
Review What is quality?
Defining Quality & DQ Importance of quality data
DQ in databases Database design Database Integrity
Some examples of poor DQ and it’s impact http://www.iqtrainwrecks.com/
Measuring DQ Managing data as product