Managing the Data Acquisition & Exchange Relationship
description
Transcript of Managing the Data Acquisition & Exchange Relationship
© Copyright 2008 Neils Michael Scofield, all rights reserved.
Managing the Data Acquisition & Exchange
RelationshipBy Michael ScofieldBy Michael Scofield
Manager, Data Asset DevelopmentManager, Data Asset DevelopmentESRI, Inc. Redlands, CAESRI, Inc. Redlands, CA
Asst. Professor, Health Information ManagementAsst. Professor, Health Information ManagementLoma Linda University Loma Linda University
Vers. 32 MSP June 9, 2008 L-3Vers. 32 MSP June 9, 2008 L-3
© Copyright 2008 Neils Michael Scofield all rights reserved
2
About Michael ScofieldAbout Michael ScofieldMichael ScofieldMichael Scofield is Manager of Data Asset Development at ESRI in is Manager of Data Asset Development at ESRI in Redlands, California. He is a popular speaker in topics of data Redlands, California. He is a popular speaker in topics of data management, data quality, data warehouse design, as well as satellite management, data quality, data warehouse design, as well as satellite imagery interpretation and emergency communications. His career has imagery interpretation and emergency communications. His career has included education and private industry in areas of data warehousing and included education and private industry in areas of data warehousing and data management. His articles appear in DM Review, the B-Eye data management. His articles appear in DM Review, the B-Eye Newsletter, InformationWeek magazine, the IBI Systems Journal, and other Newsletter, InformationWeek magazine, the IBI Systems Journal, and other professional journals. professional journals.
He has spoken to over 120 professional audiences for groups such as Data He has spoken to over 120 professional audiences for groups such as Data Management Assn chapters, European Metadata Conferences, Information Management Assn chapters, European Metadata Conferences, Information Quality Conferences, The Data Warehousing Institute, Oracle User Groups, Quality Conferences, The Data Warehousing Institute, Oracle User Groups, Institute of Internal Auditors, Assn. of Government Accountants, Quality Institute of Internal Auditors, Assn. of Government Accountants, Quality Assurance Association chapters, Assn. for Computing Machinery and other Assurance Association chapters, Assn. for Computing Machinery and other professional and civic audiences. professional and civic audiences.
Mr. Scofield is also Asst. Professor of Health Information Management at Mr. Scofield is also Asst. Professor of Health Information Management at Loma Linda University. Loma Linda University.
NMS intro
© Copyright 2008 Neils Michael Scofield all rights reserved
3
Alternate titles:Alternate titles:
““Managing the Data Acquisition Relationship”Managing the Data Acquisition Relationship”
““How Not to Mess Up When You Import Data”How Not to Mess Up When You Import Data”
“data acquisition”
…traditionally in science and engineering instrumentation.
Source User“data”
© Copyright 2008 Neils Michael Scofield all rights reserved
4
Topics & Areas of ConcernTopics & Areas of Concern
Spelling out the relationship Spelling out the relationship
Difference between data and informationDifference between data and information
Understanding specific data and information needsUnderstanding specific data and information needs
Asking for the right data and finding what you need Asking for the right data and finding what you need
Data value and utilityData value and utility
Assessing the burden on potential data providersAssessing the burden on potential data providers
Scope and complete-ness of dataScope and complete-ness of data
© Copyright 2008 Neils Michael Scofield all rights reserved
5
TopicsTopics (cont.)(cont.) Versioning and timelinessVersioning and timeliness
Media and physical formatMedia and physical format
Compatibility of logical data architecturesCompatibility of logical data architectures
Data quality assessment Data quality assessment
Updates and refresh issuesUpdates and refresh issues
Data collection biasData collection bias
Legal issuesLegal issues
Continuing data flow surveillance Continuing data flow surveillance
© Copyright 2008 Neils Michael Scofield all rights reserved
6
How do you describe a dataset?How do you describe a dataset?
ArchitectureArchitecture What subjects (things) are described by a record What subjects (things) are described by a record Facts/fields/attributes/columns Facts/fields/attributes/columns Logical data model Logical data model
ScopeScope What records are included excluded on dimensionsWhat records are included excluded on dimensions Dimensions: time, geography, org., Dimensions: time, geography, org.,
CurrencyCurrency Compared to declared scopeCompared to declared scope Table level, and column-specific Table level, and column-specific
QualityQuality PrecisionPrecision Complete-ness (by column) Complete-ness (by column) Accuracy Accuracy
However….data acquisition is much, much more.
© Copyright 2008 Neils Michael Scofield all rights reserved
7
IntroductionIntroductionWhy talk about this?Why talk about this?
Because…Because…
……we want more and more data, and we want more and more data, and we don’t generate it all ourselves. we don’t generate it all ourselves.
So….we So….we acquireacquire it somewhere else. it somewhere else.
© Copyright 2008 Neils Michael Scofield all rights reserved
8
Never a simple flow of data!Never a simple flow of data!
Source User(“target”)
“data”
Relationship
Expectations: subjects covered by data scope of data quality of data currency of data
Expectations: money how you use data burden others?
Often forgotten topics: Updates and refresh Corrections Documentation Other measures of quality
Terms: Usage rights
© Copyright 2008 Neils Michael Scofield all rights reserved
9
Pull: Data requestor sends query to
source database.
Push: Data host compiles data file and
sends a data file.
Kinds of data “flows”Kinds of data “flows”
Trigger events:
Elapsed time (day, week, month, sub-day)
Source business event (usually a transaction)
Target business event (transaction makes request
for limited data; e.g. bal. chk.)
Human decision (e.g. BI)
Record growth trigger (e.g. every 5,000 records in a source transaction file)
“Push” vs. “pull”:
When the trigger happens, which side does the heavy work?
App.target environment
target environment
Importapp.
Appl. database
query
results
© Copyright 2008 Neils Michael Scofield all rights reserved
10
Flows exist in many placesFlows exist in many places
Enterprise Data supplier
Appl. A Appl. B
Acquired division
Outsidedatauser
Appl. G
DW
Un-coordinated applications
Business Intelligence
© Copyright 2008 Neils Michael Scofield all rights reserved
11
Each source has a data architectureEach source has a data architecture
Enterprise Data supplier
Appl. A Appl. B
Acquired division
Outsidedatauser
Appl. G
DW
Un-coordinated applications
Business Intelligence
Expectations
Constraints
© Copyright 2008 Neils Michael Scofield all rights reserved
12
What is data architecture?What is data architecture?
The logical and semantic structure of the business (or The logical and semantic structure of the business (or that part of the business) and the data which describes that part of the business) and the data which describes and supports it. and supports it.
Described by a data modelDescribed by a data model
Subject entitiesSubject entities Relationships Relationships Attributes Attributes Entity-relationship diagram Entity-relationship diagram
Is abstract (not understood by many)Is abstract (not understood by many)
Can be complexCan be complex
© Copyright 2008 Neils Michael Scofield all rights reserved
13
Each FLOW has a data architectureEach FLOW has a data architecture
Enterprise Data supplier
Appl. A Appl. B
Acquired division
Outsidedatauser
Appl. G
DW
Un-coordinated applications
B.I.
Expectations
© Copyright 2008 Neils Michael Scofield all rights reserved
14
Enterprise-captured data life cycleEnterprise-captured data life cycleTransaction-based
data capture
Businessapplication
Businessdatabase
Archive DW
other in-house applications
Data derivation & enhancement
Association with own history
Integration with other lateral data
Computing derived data (ratios, aggregates, etc.)
other in-house applications
Executivesummaryreports
export
© Copyright 2008 Neils Michael Scofield all rights reserved
15
Reasons to import data Reasons to import data Enhance an internal DW for support of Enhance an internal DW for support of improved executive decision-making. improved executive decision-making.
Bolster operational data resources Bolster operational data resources independent of the data exchange independent of the data exchange relationship. relationship.
Engage in new business processes Engage in new business processes involving a B2B partnership formed involving a B2B partnership formed through data exchange.through data exchange.
E-discovery: litigationE-discovery: litigation
DWB.I.
© Copyright 2008 Neils Michael Scofield all rights reserved
16
Reasons to import data Reasons to import data
DWB.I.
Timing:
Periodic big batch files:
daily, weekly, monthly, etc.
Transaction-driven:
“micro” data flows (SOA)
One-time
© Copyright 2008 Neils Michael Scofield all rights reserved
17
Spelling out Spelling out the the
relationshiprelationship
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
18
Key questions: Key questions: What are your expectations?What are your expectations?
What are your uses of the data?What are your uses of the data?
What motivates the source to give it to you?What motivates the source to give it to you?
What are the political-cultural barriers between you and What are the political-cultural barriers between you and the source?the source?
What are your expectations of…What are your expectations of…
quality, complete-ness, currencyquality, complete-ness, currency media media updates and refresh updates and refresh
How can you strengthen the relationship? How can you strengthen the relationship?
© Copyright 2008 Neils Michael Scofield all rights reserved
19
Political & cultural barriersPolitical & cultural barriers
Separate systemyouthem
Peer division or department
youthem
Totally unrelated legal entity youthem
“Information is power!”People don’t want to give up power.
© Copyright 2008 Neils Michael Scofield all rights reserved
20
Typical risks and surprisesTypical risks and surprises
To save money, the source does not maintain previous To save money, the source does not maintain previous quality in data capture and processing. Updates show quality in data capture and processing. Updates show lower quality. lower quality.
To expand its market, the source alters the logical and To expand its market, the source alters the logical and physical data architecture without telling you. physical data architecture without telling you.
In response to business morphing pressures, the source In response to business morphing pressures, the source alters the coding scheme for one or more fields.alters the coding scheme for one or more fields.
The source discovers some errors, but does not inform The source discovers some errors, but does not inform you of it, nor supply you with corrections or corrected you of it, nor supply you with corrections or corrected records. records.
© Copyright 2008 Neils Michael Scofield all rights reserved
21
Mitigating strategiesMitigating strategies
Spell out all expectations about the data. Spell out all expectations about the data.
Develop language, words, & models to enhance Develop language, words, & models to enhance precision of communication about data expectations. precision of communication about data expectations.
Rigorous testing of data Rigorous testing of data priorprior to purchase to purchase
Strengthen relationship through cooperative data testing Strengthen relationship through cooperative data testing strategiesstrategies Offer to test their updates Offer to test their updates Provide non-threatening feedback on DQ Provide non-threatening feedback on DQ Get source to seek you out as consultant on DQ Get source to seek you out as consultant on DQ (this will allow you to monitor their morphing pressures)(this will allow you to monitor their morphing pressures)
© Copyright 2008 Neils Michael Scofield all rights reserved
22
Data & Data & informationinformation
structured data and unstructured data
What makes data (information) useful?
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
23
Data Data vsvs. information. information
data information
simple (single) observation, fact, or declaration
data (facts) with context to be more meaningful and useful
“Knowledge: valuable information from the human mind”
For many thinkers, there is a subtle, almost philosophical difference between data and information.
© Copyright 2008 Neils Michael Scofield all rights reserved
24
Initial definitions Initial definitions
RealityReality
Data Data
InformationInformation
KnowledgeKnowledge
Wisdom Wisdom
Things and events.Things and events.
A single observation about A single observation about reality, clearly defined.reality, clearly defined.
One or more items of data, One or more items of data, with definition and context to with definition and context to make it meaningful.make it meaningful.
Simultaneous awareness of Simultaneous awareness of much information, and ability much information, and ability to cognitively integrate it. to cognitively integrate it.
Knowing not to sleep Knowing not to sleep through this lecture.through this lecture.
© Copyright 2008 Neils Michael Scofield all rights reserved
25
Structural elements of tabular “data”Structural elements of tabular “data”
2
2
Piece of data; “a fact” a.k.a. “cell”
RecordTable
Database
What are you seeking? A fact, a record, a table, or a database?
© Copyright 2008 Neils Michael Scofield all rights reserved
26
Acquiring data or information?Acquiring data or information?
Tabular UnstructuredSemi-structured
Web page
Raster
Text document
Cartesian dataset
multi-table database
diary, memoirs
The web is not a source!
It is a medium!
© Copyright 2008 Neils Michael Scofield all rights reserved
27
Data vs. meaningData vs. meaning
Name AddressLucy Davis 41 Main St.Franz Kraemer 532 Elm Ave. Apt GAlex Karnov 563-A Pine StreetGlenn Pratt 78 Mills LaneDavid Orr 587 New York Ave.Peter Vines 798 Wisconsin Ave.Sally Forth 21 Market St.Adam Karr 487 Riverside Dr.
Name AddressLUCY DAVIS 41 MAIN ST.FRANZ KRAEMER 532 ELM AVE. APT GALEX KARNOV 563-A PINE STREETGLENN PRATT 78 MILLS LANEDAVID ORR 587 NEW YORK AVE.PETER VINES 798 WISCONSIN AVE.SALLY FORTH 21 MARKET STREETADAM KARR 487 RIVERSIDE DR.
Are these the same data?
Source A Source B
Same meaning? Yes. But not the same data.
Mixed case is difficult to derive correctly from ALL CAPS.
© Copyright 2008 Neils Michael Scofield all rights reserved
28
Universe of Universe of knowledge, knowledge,
information, information, & data& data
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
29
Structured vs. unstructured dataStructured vs. unstructured data
Structured data
Most tabular databases:
businessgovernmentscience & research
Can fit into RDBMS
Unstructured dataPersonal letters
Memoirs, diaries
Literature (history, poetry, fiction)
Most books
Still images (paintings, photos, x-ray, ultrasound)
Sounds (sound recordings, EKG, SOSUS)
Moving images (cinema, TV, etc.)
© Copyright 2008 Neils Michael Scofield all rights reserved
30
Structured vs. unstructured dataStructured vs. unstructured data
Structured data Unstructured data
Geospatial data
Raster imagery topos
Vector data streets, areas
“points, lines, polygons”
GIS data
© Copyright 2008 Neils Michael Scofield all rights reserved
31
Parsing and processing dataParsing and processing data
tabular unstructured
Tabular data Unstructured data
Computers are good at processing. SQL, relational model, etc.
Humans are good at processing.
memory, free association.
© Copyright 2008 Neils Michael Scofield all rights reserved
32
Processing unstructured dataProcessing unstructured data
Unstructured data
Humans are good at processing.
memory, free association.
Examples:
Hearing classical music, and correctly guessing the composer.
Recognizing the signature style of a oil painting.
Recognizing voices
Reading emotions on faces
Understanding incomplete sentences.
Seeing humor (intended and
not). .
Sta
r W
ars
© Copyright 2008 Neils Michael Scofield all rights reserved
33
Asking for Asking for the right datathe right data
…or…
Asking for the right information
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
34
Who is the first user? …the final user?Who is the first user? …the final user?
Analytical support of macro-decisionsAnalytical support of macro-decisions
Data warehouse and business intelligenceData warehouse and business intelligence Probably to be manipulated by analysts Probably to be manipulated by analysts High-level decision-maker will use final output High-level decision-maker will use final output
Operational business systemOperational business system (micro-decisions)(micro-decisions)
geocoding customersgeocoding customers CRM CRM Oil exploration Oil exploration Agricultural field characteristics Agricultural field characteristics
Pure, undirected researchPure, undirected research
Discovery for litigationDiscovery for litigation
© Copyright 2008 Neils Michael Scofield all rights reserved
35
What do What do decision-makers want?decision-makers want?
Data or information?Data or information?
© Copyright 2008 Neils Michael Scofield all rights reserved
36
““Yeah, we got data. Lots of data!”Yeah, we got data. Lots of data!”010011010111001001111011101100100010110111000101101100011001000010011010111001001111011101100100010110111000101101100011001000000000001111000000111001110000000011101101110110001000010000010000000001111000000111001110000000011101101110110001000010000010111001001111011101100100010110111000101100101100101110010011110111001001111011101100100010110111000101100101100101110010011110111011001000110010000000000011110000001110011100000000111011011111011001000110010000000000011110000001110011100000000111011011101100101101110011001011011000110010011100111000000001110000000101100101101110011001011011000110010011100111000000001110000000001101110110001000010100100100110011000000000000000110001101011001101110110001000010100100100110011000000000000000110001101011001100100111001110000000011101101110110001000010000010011001111001100100111001110000000011101101110110001000010000010011001111010011010101011100100111101110110010001011011100010110110001100010011010101011100100111101110110010001011011100010110110001100100000000000111100000011100111000000001110110111011000100001000100000000000111100000011100111000000001110110111011000100001000001011100100111101110110010001011011100010110010110010111001001001011100100111101110110010001011011100010110010110010111001001111011101100100011001000000000001111000000111001110000000011101111011101100100011001000000000001111000000111001110000000011101101110110010110111001100101101100011001001110011100000000111000101110110010110111001100101101100011001001110011100000000111000000000110111011000100001010010010011001100000000000011001001110000000110111011000100001010010010011001100000000000011001001110011100000000100110011110100100110010011100111000000001110110111011100000000100110011110100100110010011100111000000001110110111011000100001000001001100111101001000001100011010110011001001110011000100001000001001100111101001000001100011010110011001001110011010011001111000110101101011100100111101110000010010001001110011010011001111000110101101011100100111101110000010010001001110010101010001000010010001001001001000100000110010001011011100010010101010001000010010001001001001000100000110010001011011100010110110001111100110011100111000000001110110111011000100001000001110110001111100110011100111000000001110110111011000100001000001001100111100011010100000010111000011101101110110001000010000010001100111100011010100000010111000011101101110110001000010000010011001111010010011001001110011100000000111011011101100010000100011001111010010011001001110011100000000111011011101100010000100
© Copyright 2008 Neils Michael Scofield all rights reserved
37
Always strive to make information Always strive to make information more useful to the recipient! more useful to the recipient!
Los Angeles LXXIV
San AntonioLXVIIDetroit LXXXV
Boston LXXIII
Seattle LXXV
Phoenix LXXIX
Basketball scores
© Copyright 2008 Neils Michael Scofield all rights reserved
38
Data vs. expressionData vs. expression
Executive may ask for this:
% sales to sales division minorities----------------------------NORTHEAST 12.3SOUTHEAST 39.1MIDWEST 21.3SOUTHWEST 17.6PACIFIC 14.9 -----------------------------TOTAL U.S. 20.8
Are you going to ask for just six records from your source?
No! Why?
This information (report) has a high probability of being inadequate. The executive will inevitably ask for more.
© Copyright 2008 Neils Michael Scofield all rights reserved
39
Supporting macro-decisions is iterative.Supporting macro-decisions is iterative.
ProcessFiltering
AggregationExplorationCorrelation
Analysis
External sources
Internal sources
Knowledge worker(s)
Data whse
% sales to sales division minorities----------------------------NORTHEAST 12.3SOUTHEAST 39.1MIDWEST 21.3SOUTHWEST 17.6PACIFIC 14.9 -----------------------------TOTAL U.S. 20.8
Manufacturing as share of total employment
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
1950 1960 1970 1980 1990 2000 2010
32.1 %
11.7 %
Share of consumption by category
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Moto
r vehic
les
Furnitu
re &
house
hold
Other d
urable
Food
Clothing
& s
hoes
Gasolin
e, fu
els
Other n
on-dura
ble
Housin
g
House
hold
ope
ratio
n
Transp
ortatio
n
Medic
al car
e
Recre
ation
Other s
ervice
s
1929
2001
B.I.
Data mart(s)ETL
© Copyright 2008 Neils Michael Scofield all rights reserved
40
Raw data vs. derived dataRaw data vs. derived dataYou always want raw data, at the most granular level You always want raw data, at the most granular level possible ! possible !
No ratios or averages --they can NOT be aggregated. No ratios or averages --they can NOT be aggregated.
Country Pop DensBelgium 340.0France 111.3Germany 230.9Italy 193.0Netherlands 397.1Spain 80.0Switzerland 182.2
Country Population Sq Km Pop DensBelgium 10,379,067 30,528 340.0France 60,876,136 547,030 111.3Germany 82,422,299 357,021 230.9Italy 58,133,509 301,230 193.0Netherlands 16,491,461 41,526 397.1Spain 40,397,842 504,782 80.0Switzerland 7,523,934 41,290 182.2
276,224,248 1,823,407 151.5
derived dataraw data
=Avg = 219.2
Population Density
© Copyright 2008 Neils Michael Scofield all rights reserved
41
Anticipate the analysis and Anticipate the analysis and information delivery.information delivery.
Have data analysis tools ready. Have data analysis tools ready.
Output will be iterative. Output will be iterative.
Best output allows for graphic analysisBest output allows for graphic analysis
Time series are valuable…Time series are valuable…
… …but require history. but require history.
Don’t neglect history when asking for data. Don’t neglect history when asking for data.
© Copyright 2008 Neils Michael Scofield all rights reserved
42
Why trend graphs Why trend graphs (a.k.a. “time series”)?(a.k.a. “time series”)?
257 Deaths per 100,000 persons due to heart disease in CY-2000
Deaths from heart disease in U.S.
0
100
200
300
400
500
600
1960 1970 1980 1990 2000 2010
Dea
ths
per
100,
000
pers
ons
257
559 deaths per 100,000
This statistic alone, lacks meaning!
We must give it context!
© Copyright 2008 Neils Michael Scofield all rights reserved
43
How do executives make decisions?How do executives make decisions?Cognitive vs. feelingsCognitive vs. feelings
Week 1 Week 2 Week 3 Week 4Product Line Mon. Tue Wed Thu Fri Mon. Tue Wed Thu Fri Mon. Tue Wed Thu Fri Mon. TuePeas 45.6 46.6 47.1 48.1 51.4 43.6 44.1 45.5 45.1 48.4 47.9 46.2 47.1Carrots 20.4 20.8 21.1 21.5 22.9 19.5 19.7 20.3 20.1 21.6 21.4 20.6 21.1Tomatos 75.8 77.4 78.2 79.8 85.2 72.3 73.2 75.5 74.8 80.3 79.6 76.6 78.2Cucumbers 21.4 21.8 22.1 22.5 24 20.4 20.6 21.3 21.1 22.6 22.4 21.6 22.1Green beans 35.9 36.7 37.1 37.9 40.4 34.3 34.7 35.8 35.5 38.1 37.7 36.3 37.1Corn 57.3 58.5 59.1 60.4 64.5 54.7 55.3 57.1 56.6 60.7 60.2 57.9 59.1Esparigus 2.91 2.98 3.01 3.07 3.28 2.78 2.81 2.9 2.88 3.09 3.06 2.95 3.01Borcolli 13.6 13.9 14 14.3 15.3 13 13.1 13.6 13.4 14.4 14.3 13.7 14Oranges 69 70.4 71.2 72.6 77.6 65.8 66.6 68.7 68.1 73.1 72.4 69.7 71.2Lemons 10.7 10.9 11 11.3 12 10.2 10.3 10.7 10.6 11.3 11.2 10.8 11Pineapple 27.2 27.8 28.1 28.6 30.6 26 26.3 27.1 26.9 28.8 28.6 27.5 28.1Lettuce 94.2 96.2 97.2 99.2 106 89.9 91 93.9 93.1 99.8 99 95.3 97.2Garlic 1.94 1.98 2.01 2.05 2.19 1.85 1.88 1.94 1.92 2.06 2.04 1.96 2.01Guava 0.97 0.99 1 1.02 1.09 0.93 0.94 0.97 0.96 1.03 1.02 0.98 1Blackberries 2.91 2.98 3.01 3.07 3.28 2.78 2.81 2.9 2.88 3.09 3.06 2.95 3.01Strawberries 7.77 7.93 8.02 8.18 8.74 7.42 7.5 7.75 7.68 8.23 8.16 7.86 8.02Blueberries 27.2 27.8 28.1 28.6 30.6 26 26.3 27.1 26.9 28.8 28.6 27.5 28.1Rapsberries 11.7 11.9 12 12.3 13.1 11.1 11.3 11.6 11.5 12.3 12.2 11.8 12Boysenberries 8.74 8.93 9.02 9.21 9.83 8.34 8.44 8.71 8.63 9.26 9.18 8.84 9.02
TOTAL 535 546 552 564 602 511 517 534 529 567 562 541 552 0 0 0 0
When executives ask for data or information, be sure they understand the total costs.
© Copyright 2008 Neils Michael Scofield all rights reserved
44
Tables (raw data) are hard to understandTables (raw data) are hard to understand
U.S. Monthly unemployment statistics
Year J an Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1980 6.3 6.3 6.3 6.9 7.5 7.6 7.8 7.7 7.5 7.5 7.5 7.21981 7.5 7.4 7.4 7.2 7.5 7.5 7.2 7.4 7.6 7.9 8.3 8.51982 8.6 8.9 9 9.3 9.4 9.6 9.8 9.8 10.1 10.4 10.8 10.81983 10.4 10.4 10.3 10.2 10.1 10.1 9.4 9.5 9.2 8.8 8.5 8.31984 8 7.8 7.8 7.7 7.4 7.2 7.5 7.5 7.3 7.4 7.2 7.31985 7.3 7.2 7.2 7.3 7.2 7.4 7.4 7.1 7.1 7.1 7 71986 6.7 7.2 7.2 7.1 7.2 7.2 7 6.9 7 7 6.9 6.61987 6.6 6.6 6.6 6.3 6.3 6.2 6.1 6 5.9 6 5.8 5.71988 5.7 5.7 5.7 5.4 5.6 5.4 5.4 5.6 5.4 5.4 5.3 5.31989 5.4 5.2 5 5.2 5.2 5.3 5.2 5.2 5.3 5.3 5.4 5.41990 5.4 5.3 5.2 5.4 5.4 5.2 5.5 5.7 5.9 5.9 6.2 6.31991 6.4 6.6 6.8 6.7 6.9 6.9 6.8 6.9 6.9 7 7 7.31992 7.3 7.4 7.4 7.4 7.6 7.8 7.7 7.6 7.6 7.3 7.4 7.41993 7.3 7.1 7 7.1 7.1 7 6.9 6.8 6.7 6.8 6.6 6.51994 6.6 6.6 6.5 6.4 6.1 6.1 6.1 6 5.9 5.8 5.6 5.51995 5.6 5.4 5.4 5.8 5.6 5.6 5.7 5.7 5.6 5.5 5.6 5.61996 5.6 5.5 5.5 5.6 5.6 5.3 5.5 5.1 5.2 5.2 5.4 5.41997 5.3 5.2 5.2 5.1 4.9 5 4.9 4.8 4.9 4.7 4.6 4.71998 4.6 4.6 4.7 4.3 4.4 4.5 4.5 4.5 4.6 4.5 4.4 4.41999 4.3 4.4 4.2 4.3 4.2 4.3 4.3 4.2 4.2 4.1 4.1 42000 4 4.1 4 3.8 4 4 4 4.1 3.9 3.9 3.9 3.92001 4.2 4.2 4.3 4.4 4.3 4.5 4.6 4.9 5 5.3 5.5 5.72002 5.7 5.7 5.7 5.9 5.8 5.8 5.8 5.7 5.7 5.7 5.9 62003 5.8 5.9 5.9 6 6.1 6.3 6.2 6.1 6.1 6 5.8 5.72004 5.7 5.6 5.8 5.6 5.6 5.6 5.5 5.4 5.4 5.5 5.4 5.42005 5.2 5.4 5.2 5.1 5.1 5 5 4.9 5.1 5 5 4.82006 4.7 4.7 4.7 4.7 4.7 4.6 4.7 4.7 4.5 4.4 4.5 4.42007 4.6 4.5 4.4 4.5 4.5 4.6 4.7 4.7 4.7 4.8 4.7 52008 4.9 4.8 5.1 5 5.5
© Copyright 2008 Neils Michael Scofield all rights reserved
45
U.S. unemployment rate, seasonally adj
0
1
2
3
4
5
6
7
8
9
Jan-
92
Jan-
93
Jan-
94
Jan-
95
Jan-
96
Jan-
97
Jan-
98
Jan-
99
Jan-
00
Jan-
01
Jan-
02
Jan-
03
Jan-
04
Jan-
05
Jan-
06
Jan-
07
Jan-
08
Unemployment Unemployment
Source: Bureau of Labor Statistics web site
Clinton Bush II
© Copyright 2008 Neils Michael Scofield all rights reserved
46
Placing data points into context Placing data points into context yields information!yields information!
Surround your requested data points with context!
Time series
Peer data
Causal factors
Breakdown / drilldown
Graphical expression
All these require many more data points than the executive originally requested!
On nearly every dimension.
© Copyright 2008 Neils Michael Scofield all rights reserved
47
Choices in detail of dataChoices in detail of data
Original or derivativeOriginal or derivative
Granular or summaryGranular or summary
Filtered or notFiltered or not
Translated or notTranslated or not
Data is always easier to aggregate than to disaggregate!
It is always easier to filter out unneeded data than to request more data later.
© Copyright 2008 Neils Michael Scofield all rights reserved
48
Converting data to informationConverting data to information
Query and reporting tool are requiredQuery and reporting tool are required
Needed functions:Needed functions:
AggregationAggregation
Sorting and filteringSorting and filtering
Association and joiningAssociation and joining
Clustering and stratificationClustering and stratification
GraphicsGraphics
© Copyright 2008 Neils Michael Scofield all rights reserved
49
Converting data to informationConverting data to information
200 deaths from TBin Baker County,CY-2004
Raw data50,000
Avg. populationBaker County,CY-2004
4 deaths per 1,000 pop.
Baker Co., CY-2004
County TB deaths Population TB RateAdams 128 21,490 6.0Baker 200 50,000 4.0Carswell 87 17,215 5.1Davis 189 41,200 4.6Eaton 200 38,000 5.3
Conclusion: Baker County has the lowest TB rate of 5 peer counties.
Add context
Compute ratio
Add context
© Copyright 2008 Neils Michael Scofield all rights reserved
50
Converting data to informationConverting data to information
Add context
Compute ratio
Add context
Raw data
Useful information
TB rate by county
6.0
4.0
5.14.6
5.3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Adams Baker Carswell Davis Eaton
© Copyright 2008 Neils Michael Scofield all rights reserved
51
But time series is even better.But time series is even better.
TB rates by county
0
1
2
3
4
5
6
7
8
1999 2000 2001 2002 2003 2004 2005
Adams
Baker
Carswell
Davis
Eaton
Baker County
Baker County
Skip gasoline
© Copyright 2008 Neils Michael Scofield all rights reserved
52
Can precision be distracting?Can precision be distracting?
Assets U.S. DollarsCurrent assets:
Cash and cash equivalents 12,568,197,382.24Marketable securities 1,118,075,118.52
Notes and accts recievable 9,540,118,972.94Short-term financing receivables 13,750,181,442.41
Other accounts receivable 1,138,348,791.55Inventories 2,841,211,897.62
Deferred taxes 1,765,108,773.94Prepaid expenses and other current assets 2,941,012,486.33
Total current assets 45,662,254,865.55
Assets $ MilCurrent assets:
Cash and cash equivalents 12,568Marketable securities 1,118
Notes and accts recievable 9,540Short-term financing receivables 13,750
Other accounts receivable 1,138Inventories 2,841
Deferred taxes 1,765Prepaid expenses and other current assets 2,941
Total current assets 45,662
Rounded to $ mil.
Source: IBM Annual Report, 2005. Pennies contrived.
© Copyright 2008 Neils Michael Scofield all rights reserved
53
How much scope?How much scope?
My fundamental bias:My fundamental bias:
Get as much as you can get for the same price.Get as much as you can get for the same price.
TimeTime
OrganizationalOrganizational
Cost is mainly labor--creating the extract file. Cost is mainly labor--creating the extract file.
Same labor for getting 4 years of history as 2 years. Same labor for getting 4 years of history as 2 years.
Media and storage costs are trivial. Media and storage costs are trivial.
© Copyright 2008 Neils Michael Scofield all rights reserved
54
Why more data?Why more data?
TestingTesting
Continuity of definitions over time.Continuity of definitions over time.
Reasonableness of row counts, etc. Reasonableness of row counts, etc.
Test predictive models on historical data.Test predictive models on historical data.
Decision-makers will expand scope of query later. Decision-makers will expand scope of query later.
Context! You can never have too much context!
© Copyright 2008 Neils Michael Scofield all rights reserved
55
Options in granularityOptions in granularity
Line item detailProduct Qty sold
1004 1571005 1091006 1421007 75
Product summary
Customer Revenue4778 2,951.844779 3,357.724780 3,876.704781 3,803.81
Customer value
Date Customer Product Qty Un Price Ext Price1/4/2007 4781 1004 60 37.81 2,268.601/5/2007 4780 1004 20 37.81 756.201/6/2007 4779 1005 37 13.98 517.261/7/2007 4778 1006 10 28.99 289.901/8/2007 4781 1004 15 37.81 567.151/9/2007 4780 1005 20 13.98 279.60
1/10/2007 4779 1006 15 28.99 434.851/11/2007 4778 1006 12 28.99 347.881/12/2007 4781 1005 18 13.98 251.641/13/2007 4780 1004 24 37.81 907.441/14/2007 4779 1006 30 28.99 869.701/15/2007 4778 1006 12 28.99 347.881/16/2007 4781 1007 10 32.18 321.801/17/2007 4780 1006 30 28.99 869.701/18/2007 4779 1007 18 32.18 579.241/19/2007 4778 1004 12 37.81 453.721/20/2007 4781 1004 6 37.81 226.861/21/2007 4780 1005 22 13.98 307.561/22/2007 4779 1006 18 28.99 521.821/23/2007 4778 1007 37 32.18 1,190.661/24/2007 4781 1005 12 13.98 167.761/25/2007 4780 1004 20 37.81 756.201/26/2007 4779 1006 15 28.99 434.851/27/2007 4778 1007 10 32.18 321.80
raw data file derivative data files
Customer Product Revenue4778 1004 453.724778 1006 985.664778 1007 1512.464779 1005 517.264779 1006 2261.224779 1007 579.244780 1004 2419.844780 1005 587.164780 1006 869.74781 1004 3062.614781 1005 419.44781 1007 321.8
Prod./cust summary
© Copyright 2008 Neils Michael Scofield all rights reserved
56
Potential data Potential data providers:providers:
your impact your impact upon themupon them
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
57
Key questions: Key questions:
Does the data come from their operations?Does the data come from their operations?
Do they log business transactions adequately?Do they log business transactions adequately?
Do they log changes to kernel-stable entities Do they log changes to kernel-stable entities adequately?adequately?
What “enhancements” must they make to their What “enhancements” must they make to their application to extract the data you desire?application to extract the data you desire?
What cutoff policies do they have on transactions?What cutoff policies do they have on transactions?
© Copyright 2008 Neils Michael Scofield all rights reserved
58
Kinds of data source organizationsKinds of data source organizations
Selling (providing) data is a sideline to their primary business
Selling data is a major source of revenue
Sharing data is a cultural value, not for revenue
BanksCredit card issuersHealthcare org’sInsuranceRetailersAirlines Telephone
Credit bureaus (Equifax, Experian, Trans Union)
Marketing companies (D&B, DMA)
Suppliers of… maps imagery
News org’s (UPI)
Knowledge sellers (Lexus-Nexus)
Government agencies
Academic research
NGO’s
© Copyright 2008 Neils Michael Scofield all rights reserved
59
Kinds of data source organizationsKinds of data source organizations
Selling (providing) data is a sideline to their primary business
Selling data is a major source of revenue
Sharing data is a cultural value, not for revenue
May sell you the data, but more guarded about the documentation.
Writing external data documentation is an annoyance !
Simpler datasets require less semantic documentation.
These people ought to get the documentation right and rich!
Ask to see it first.
© Copyright 2008 Neils Michael Scofield all rights reserved
60
What burden will you place on the What burden will you place on the data provider?data provider?
Depends upon…Depends upon…
how they store and manage their datahow they store and manage their data
……and…and…
your needs of scope, architecture, timing, your needs of scope, architecture, timing, quality.quality.
© Copyright 2008 Neils Michael Scofield all rights reserved
61
Two kinds of data generationTwo kinds of data generation
Data as byproduct of business processes
Data as gathered as non-business research
commercial sector
banking manufacturingretail salescustomer service activities (utilities, communications, etc.) hospital patient records & billinginsurance policy setup and claimseducation: student enrollment, grades, etc.
governments
social welfare and public assistancetax collectioncity services (trash, utilities) votingpublic libraries (patron activity)
field surveys of land, topo, etc.
observations of external behavior: weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies
satellite & aerial imagery
Hybrid: strategic intelligence, police surveillance, mineral exploration, etc.
© Copyright 2008 Neils Michael Scofield all rights reserved
62
Two kinds of data generationTwo kinds of data generation
Data as byproduct of business processes
Data as gathered as non-business research
Captured through business applications
Complex logical data architectures
May not have complete logging
Data extract may be a logistical and programming burden
Generally must be done for DW.
Captured through special studies
Generally simple logical data architectures
Not stored in application databases
Easier to extract and make copies
© Copyright 2008 Neils Michael Scofield all rights reserved
63
Research dataResearch data
field surveys of land, topo, etc.
observations of external behavior: weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies
satellite & aerial imagery
Data as gathered as non-business research
Simple logical data architecture
© Copyright 2008 Neils Michael Scofield all rights reserved
64
Research dataResearch data
Example: interview-based research -- census
Interviewee
Family
Residence
Employment
Data as gathered as non-business research
Simple logical data architecture
© Copyright 2008 Neils Michael Scofield all rights reserved
65
Research dataResearch data Data as gathered as non-business research
Example: cancer research study
Patient
Family
Examination& diagnosis
Hospital stay
Treatment
episode
eventevent
kernel-stable
kernel-stable
Simple logical data architecture
© Copyright 2008 Neils Michael Scofield all rights reserved
66
Data created in business processesData created in business processes
Business application
software
Application database
usersbusiness
transactions
database reads and writes
© Copyright 2008 Neils Michael Scofield all rights reserved
67
Application characteristicsApplication characteristics
Business application
software
Built to facilitate business operations.
Data captured to support ops.
Has a logical data architecture: you need to understand it.
Generally designed to meet on-line performance expectations.
Memory (versioning) often not important on many entities (particularly customers).
Business will not stop (“freeze”) for you to extract data.
© Copyright 2008 Neils Michael Scofield all rights reserved
68
Application database characteristicsApplication database characteristics
Architecture supports application.
Hopefully well-normalized.
May or may not include business event logging.
DBMS: IMS, relational, network, flat
Application database
© Copyright 2008 Neils Michael Scofield all rights reserved
69
Application database characteristicsApplication database characteristics
Application database
master files
kernel-stable
transactionschange
logs
events business logs or DBMS logs
© Copyright 2008 Neils Michael Scofield all rights reserved
70
Over-time life cycle of subject entitiesOver-time life cycle of subject entities
long lives
“kernel-stable”
limited life
“episode”
point-in-time transactions
“events”
© Copyright 2008 Neils Michael Scofield all rights reserved
71
Kernel-stable entitiesKernel-stable entities
customerscustomers
partiesparties
peoplepeople
departmentsdepartments
productsproducts
servicesservices
facilitiesfacilities
vehiclesvehicles
ships and aircraftships and aircraft
library holdinglibrary holding
propertiesproperties
cost centerscost centers
accounts accounts (bank, credit card, G/L)(bank, credit card, G/L)
corporationcorporation
institutioninstitution
groupings of above ***groupings of above ***
© Copyright 2008 Neils Michael Scofield all rights reserved
72
Episode-like entitiesEpisode-like entities
hospital stayhospital stay
subscriptionsubscription
illnessillness
maintenance & maintenance & support contractsupport contract
employment periodemployment period
projectproject
library check-outlibrary check-out
hotel room stayhotel room stay
prison sentenceprison sentence
unemployment benefitunemployment benefit
conference registrationconference registration
college enrollmentcollege enrollment
phone call phone call (successful)(successful)
accounting periodaccounting period
© Copyright 2008 Neils Michael Scofield all rights reserved
73
Event or transaction entitiesEvent or transaction entities
customer's ordercustomer's order
shipmentshipment
invoiceinvoice
G/L postingG/L posting
phone call (failed)phone call (failed)
sale of assetsale of asset
treatmenttreatment
test or observationtest or observation
airline flightairline flight
inquiryinquiry
turnstile passageturnstile passage
application for collegeapplication for college
graduationgraduation
credit card chargecredit card charge
paycheckpaycheck
© Copyright 2008 Neils Michael Scofield all rights reserved
74
Kernel-stable entitiesKernel-stable entities
customerscustomerspartiespartiespeoplepeopledepartmentsdepartmentsproductsproductsservicesservicesfacilitiesfacilitiesvehiclesvehicleslibrary holdinglibrary holdingpropertypropertycost centercost centerGL accountGL accountinstitutioninstitutiongroupingsgroupings
Stable identity and existence.
2 kinds of changes: change to existence or ID change to non-key attribute
Changes to attributes occur rarely.
Are such changes logged in the application?
Are all changes logged?
Is versioning valued?
© Copyright 2008 Neils Michael Scofield all rights reserved
75
Episode-like entitiesEpisode-like entities
hospital stayhospital staysubscriptionsubscriptionillnessillnesscontract for servicecontract for serviceemployment periodemployment periodprojectprojectlibrary check-outlibrary check-outhotel room stayhotel room stayprison sentenceprison sentenceunemployment ben.unemployment ben.conf. registrationconf. registrationcollege enrollmentcollege enrollmentphone callphone call
Always exist over a finite period of time.
End-point not always known.
Often confused with the starting event. (may have same key)
May have many kinds of subordinate events.
May have subordinate episodes.
May or may not be mutually exclusive with peers.
© Copyright 2008 Neils Michael Scofield all rights reserved
76
Event or transaction entitiesEvent or transaction entities
customer's ordercustomer's ordershipmentshipmentinvoiceinvoiceG/L postingG/L postingphone call (failed)phone call (failed)sale of assetsale of assettreatmenttreatmenttest or observationtest or observationairline flightairline flightturnstile passageturnstile passageapplicationapplicationgraduationgraduationcr. card chargecr. card chargepaycheckpaycheck
Not designed to last a long time.
Generally only one key date/time.
Revisions may occur, but rare.
May be negated or reversed by subsequent transaction.
Mutually exclusive with peers.
May be subordinate to one or more episodes.
Almost always subordinate to other kernel entity(s).
© Copyright 2008 Neils Michael Scofield all rights reserved
77
Event or transaction entitiesEvent or transaction entities
customer's ordercustomer's ordershipmentshipmentinvoiceinvoiceG/L postingG/L postingphone call (failed)phone call (failed)sale of assetsale of assettreatmenttreatmenttest or observationtest or observationairline flightairline flightturnstile passageturnstile passageapplicationapplicationgraduationgraduationcr. card chargecr. card chargepaycheckpaycheck
BIG QUESTION!
Can records be changed (updated, corrected) after creation?
This has profound consequence upon your over-time updating of your copy of the data.
© Copyright 2008 Neils Michael Scofield all rights reserved
78
Basic accounting entitiesBasic accounting entities
accounting period
balance sheet
Jan. 1, 2005
balance sheet
Dec. 31, 2005
All accounting data are either…
1. events (postings)
2. aggregates of events over a time period (episode), …or…
3. statement of condition at a point in time (balance sheet).
episode
eventevent
© Copyright 2008 Neils Michael Scofield all rights reserved
79
Confusing episodes and eventsConfusing episodes and events
book checked out
check-out event
expected return date
Other ambiguities:
incarceration
phone call
hospital admission
airline flight
episode
eventevent
© Copyright 2008 Neils Michael Scofield all rights reserved
80
Episodes can contain eventsEpisodes can contain events
Hotel roomstay
Charge
Hospitalstay
Test
Medication
Project
Tasks
Laborcharges
War
Campaign
Battle
CasualtyEpisodes may contain “sub-episodes”
episode
event
episode
event
episode
episode
event
episode
episode
event
episode
© Copyright 2008 Neils Michael Scofield all rights reserved
81
Data created in business processesData created in business processes
Business application
software
Application database
usersbusiness
transactions
database reads and writes
What data do you need from this environment?
What timing? once, continuous
How do you expect it to be extracted?
Who is going to make that happen?
Architecture and application design are often barriers to sharing data in an organization.
© Copyright 2008 Neils Michael Scofield all rights reserved
82
Are your data needs 1-time, or continuous?Are your data needs 1-time, or continuous?
time
big first extract
and load
Jan. Feb. Mar. Apr. May
overtime “refresh” or update
What data do you want in “updates”?
Incremental or complete refresh?
How about corrections?
© Copyright 2008 Neils Michael Scofield all rights reserved
83
Full refresh vs. incremental updatesFull refresh vs. incremental updates
Simple extract for source: delete and reload target.
Complex processing both at source and at target
Jan. Feb. Mar. Apr. May
overtime “refresh” or update
. . .
Jan. Feb. Mar.initial
Consumes more resources
Resource (bandwidth) efficient.
You don’t know what changed, and what was deleted.
You can measure rate of change easily.
Both may require coding and paradigm translation
month-end copies
© Copyright 2008 Neils Michael Scofield all rights reserved
84
If logging takes place…where?If logging takes place…where?
Business application
software
Application database
usersbusiness
transactions
database reads and writes
Business event logging
DBMS technical logging
© Copyright 2008 Neils Michael Scofield all rights reserved
85
How are backups taken?How are backups taken?
Business application
software
Application database
usersbusiness
transactions
database reads and writes DBMS backup for
archive, recovery
© Copyright 2008 Neils Michael Scofield all rights reserved
86
How are backups taken?How are backups taken?
Business application
software
Application database
usersbusiness
transactions
database reads and writes
DBMS backup for archive, recovery
Extract business datafrom technical
backup
Business-readable full refresh
© Copyright 2008 Neils Michael Scofield all rights reserved
87
Is there a data warehouse?Is there a data warehouse?
Business application
software
Application database
usersbusiness
transactions
database reads and writes
Periodic extracts
DW
Does this have the data you need?
© Copyright 2008 Neils Michael Scofield all rights reserved
88
Dangers of DW extract data:Dangers of DW extract data:
May not have the granularity you need.May not have the granularity you need. May already have been aggregatedMay already have been aggregated
May not have desired fieldsMay not have desired fields
May not have required scope May not have required scope (org, geo, etc.)(org, geo, etc.)
May not include correctionsMay not include corrections
May not match your needs of time covered May not match your needs of time covered
May have been transformed, cleansed, or May have been transformed, cleansed, or filtered in some way. filtered in some way.
DW
ETLfile
© Copyright 2008 Neils Michael Scofield all rights reserved
89
Distinguish between update & correctionDistinguish between update & correction
time
big first extract
and load
Jan. Feb. Mar. Apr. May
incremental updates
corrections
© Copyright 2008 Neils Michael Scofield all rights reserved
90
Update vs. correction Update vs. correction
Address was 123 Main St.
Is now 548 Elm St.
He moved on April 4 (effective date)
We learned about it May 25.
We posted it on June 3 (record
change date)
Record showed 519 Fern St.
Should have been 984 Mills.
He never lived at 519 Fern Street.
It was an error.
It was never true.
© Copyright 2008 Neils Michael Scofield all rights reserved
91
Logical and physical structure of…Logical and physical structure of…
extract(big bulk snapshot)
update(new, change, delete)
correction
Will the physical file transfer format recognize nulls?
Logical data architecture
describes
data model
Logical data architecture
Logical data architecture
data model data model
somewhat similar, but not the same
© Copyright 2008 Neils Michael Scofield all rights reserved
92
Source burden for incremental updateSource burden for incremental updateCreate a record when any major table experiences a…Create a record when any major table experiences a…
new recordnew record
change in an existing recordchange in an existing record
delete (or tag “delete”) of existing recorddelete (or tag “delete”) of existing record
Change of kernel-stable records generally reflects a business event, and thus should be logged by application.
But is it? Or are all kernel entities so logged?
© Copyright 2008 Neils Michael Scofield all rights reserved
93
Is it important for you to know what Is it important for you to know what changed? Why?changed? Why?
Are the major changes to kernel-stable entities Are the major changes to kernel-stable entities important to know?important to know?
Yes, they are, if they serve as dimensions.Yes, they are, if they serve as dimensions.
Discontinuities of dimensions are problematic Discontinuities of dimensions are problematic (an understatement) (an understatement) ! !
© Copyright 2008 Neils Michael Scofield all rights reserved
94
Example of kernel-stable entity changesExample of kernel-stable entity changes
Customer
Address in time
Is it important to know address of customer for past history?
Does application software maintain address history?
If not, do you need to track such changes (go forward) ?
Are such changes being logged by application?
Change log
Who wants to know about street address history?
marketing analysisepidemiology studiescredit rating analysissecurity clearance research
© Copyright 2008 Neils Michael Scofield all rights reserved
95
What are volatile fields (attributes)?What are volatile fields (attributes)?
Volatile attributes:Volatile attributes:
Street addressStreet address
Cell phone numberCell phone number
e-mail addresse-mail address
Stable attributes:
SpouseChildren
© Copyright 2008 Neils Michael Scofield all rights reserved
96
Documentation burden upon sourceDocumentation burden upon source
Nobody likes writing data documentation
(except, perhaps, some data bigots).
Especially so…
…when incidental to their primary duties.
Especially so…
…long after the system change was made.
Possible solution:
For a discount, offer to send back to them data behavior documentation.
Requires reverse data engineering
© Copyright 2008 Neils Michael Scofield all rights reserved
97
Other issuesOther issuesBe suspicious of tabularizing unstructured Be suspicious of tabularizing unstructured datadata
Often requires coding taxonomies…Often requires coding taxonomies…
… … are they sufficiently granular? are they sufficiently granular?
© Copyright 2008 Neils Michael Scofield all rights reserved
98
Example: coding traffic fatalitiesExample: coding traffic fatalities
Roll over after skidRoll over after skid
Hit center dividerHit center divider
Hit bridge abutmentHit bridge abutment
Drove off a cliffDrove off a cliff
Drove into drainage ditchDrove into drainage ditch
Hit a deerHit a deer
Tree fell on vehicleTree fell on vehicle
Collision with parked trailerCollision with parked trailer
Bicyclist hit treeBicyclist hit tree
1 Auto-pedestrian
2 Auto-auto
3 Auto-fixed object
4 Auto-railroad
© Copyright 2008 Neils Michael Scofield all rights reserved
99
Kinds of data source organizationsKinds of data source organizations
Selling (providing) data is a sideline to their primary business
Selling data is a major source of revenue
Sharing data is a cultural value, not for revenue
BanksCredit card issuersHealthcare org’sInsuranceRetailersAirlines Telephone
Credit bureaus (Equifax, Experian, Trans Union)
Marketing companies (D&B, DMA)
Suppliers of… maps imagery
News org’s (UPI)
Knowledge sellers (Lexus-nexus)
Government agencies
Academic research
NGO’s
© Copyright 2008 Neils Michael Scofield all rights reserved
100
Physical form Physical form and mediaand media
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
101
Key questions: Key questions:
Is the data being supplied on media which you can Is the data being supplied on media which you can read with your technology?read with your technology?
Is a special program or database management system Is a special program or database management system required to read it?required to read it?
Is the documentation supplied in a manner which you Is the documentation supplied in a manner which you can read and copy?can read and copy?
Is the data supplied in bulk, or incrementally, or even Is the data supplied in bulk, or incrementally, or even one transaction at a time? one transaction at a time?
Are there any compression techniques used on all or Are there any compression techniques used on all or certain types of data in the file? certain types of data in the file?
© Copyright 2008 Neils Michael Scofield all rights reserved
102
Structured vs. unstructuredStructured vs. unstructuredStructured Unstructured
but indexed.Unstructured NOT indexed.
Anything in a… spreadsheet DBMS file with defined fields
Automated-ly managed… documents (document mgmt systems)
medical records medical imaging satellite imagery sound and video
Encyclopedia
MemoirsPersonal lettersLiteratureMeeting minutesBlogsPictures of my vacation
Library books are catalogued as a whole, but not in part.
© Copyright 2008 Neils Michael Scofield all rights reserved
103
Search engines and indexingSearch engines and indexing
The internet is a medium, not a source and The internet is a medium, not a source and certainly not an “authoritative source”. certainly not an “authoritative source”.
Each web site probably has an agenda and bias.Each web site probably has an agenda and bias.
Search engines find Search engines find texttext—not meaning.—not meaning.
Web sites can mask tabular data from search Web sites can mask tabular data from search engines. engines.
Search engines may not see some academic Search engines may not see some academic sources sources (peer-reviewed journals, etc.)(peer-reviewed journals, etc.) because of cost because of cost of access. of access.
© Copyright 2008 Neils Michael Scofield all rights reserved
104
Physical media for structured dataPhysical media for structured data
Physically moved –media:
Punched cardsHalf-inch mag tapeIBM tape cartridges9-inch floppy disk5-inch floppy disk3-1/2 inch floppy diskother cassettes or cartridges CD-ROMDVD …paper (yikes!)
Data moved virtually:
Electronic filesMessages (transactions)
Physical formats:Full database (req. DBMS)Flat file (positional) single-format multiple format w/ rec typeChar-delimited fileMS/Excel (or MS/Word) XMLzip fileother
SOM chaos
© Copyright 2008 Neils Michael Scofield all rights reserved
105
Details in flat filesDetails in flat files Two record types!business key
record type code
1 A1 B2 A2 B2 B2 B3 A3 B4 A5 A5 B6 A6 B6 B6 B6 B
Rec type B
Rec type A
© Copyright 2008 Neils Michael Scofield all rights reserved
106
Details in flat filesDetails in flat files
1 A2 A3 A4 A5 A1 B2 B2 B2 B3 B3 B4 B5 B5 B5 B
Two record types!business key
record type code
Note:
variable-length records
Note:
Children not grouped with parents.
© Copyright 2008 Neils Michael Scofield all rights reserved
107
Worse scenario of flat fileWorse scenario of flat file
Two record types!
business key, but found only in rec.type A
record type code
Important!
Record sequence have vital significance!
A 1BA 2BBBA 3BA 4A 5BA 6BBBB
Bad technique!
© Copyright 2008 Neils Michael Scofield all rights reserved
108
Mixed mediaMixed media
Data in RDBMSMetadata in
XML
Whole package you are provided
© Copyright 2008 Neils Michael Scofield all rights reserved
109
Typical elements of a GeodatabaseTypical elements of a Geodatabase
relationship class
domain
Table 1 Table 2
Feature class
Topology(rules)
Raster dataset(s)
Metadata in XML
Geometric network
© Copyright 2008 Neils Michael Scofield all rights reserved
110
XML ?XML ?
XML labels data items. XML labels data items.
““self-documenting”self-documenting”
Means labeling, but not full, rich Means labeling, but not full, rich documentation of business meaning. documentation of business meaning.
It does not describe attributes or entitiesIt does not describe attributes or entities (fields, or tables)(fields, or tables) from a business from a business perspective. perspective.
XML takes more space – often much more.XML takes more space – often much more. (the opposite of data compression?)(the opposite of data compression?)
© Copyright 2008 Neils Michael Scofield all rights reserved
111
The opposite of XML: The opposite of XML: Data compression!Data compression!
What compression techniques, if any, might the source use when sending you the data?
Can you read it or unpack it?
© Copyright 2008 Neils Michael Scofield all rights reserved
112
Logical data Logical data architecturearchitecture
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
113
Key questions: Key questions:
What kind of things in the real world are described by What kind of things in the real world are described by the dataset?the dataset?
How many kinds of tables or records are contained?How many kinds of tables or records are contained?
What are the cardinality rules between them? What are the cardinality rules between them?
Are the described instances in the real world mutually Are the described instances in the real world mutually exclusive? exclusive?
Are there format standards (industry or discipline) for Are there format standards (industry or discipline) for this kind of data?this kind of data?
Does this data conform to those format standards?Does this data conform to those format standards?
© Copyright 2008 Neils Michael Scofield all rights reserved
114
Key questions: Key questions: (cont.)(cont.)
What is the meaning of each record? What “thing” in What is the meaning of each record? What “thing” in reality does it represent?reality does it represent?
What is the business meaning of each field?What is the business meaning of each field?
Are any fields employed for more than one purpose?Are any fields employed for more than one purpose?
Is the value or meaning of any field contingent upon the Is the value or meaning of any field contingent upon the value in another?value in another?
What coding conventions are employed?What coding conventions are employed?
How are names and addresses structured?How are names and addresses structured?
Are you going to be integrating this source Are you going to be integrating this source with other data?with other data?
© Copyright 2008 Neils Michael Scofield all rights reserved
115
Ambiguity of terms!Ambiguity of terms!
“System” A “System” B
“interface”“bridge”
“connect”“data access”
“migrate data”
“flow”
“link”
© Copyright 2008 Neils Michael Scofield all rights reserved
116
Inherently ambiguous terms about “link”Inherently ambiguous terms about “link”
interface
bridge
“integrate with”
support
connect connector interconnect
exchange data
migrate data
publish
provide access
exchange data
All have in common:
data movement
What kind of data? What fields? What architecture? What causes data to move?
© Copyright 2008 Neils Michael Scofield all rights reserved
117
Linking organizations together: “Ha!”Linking organizations together: “Ha!”
infrastructure
Op Sys
ApplicationSoftware
DataApplication Database
Business
infrastructure
Op Sys
ApplicationSoftware
DataApplication Database
Physical communication.
Semantic compatibility.
Protocol compatibility.
Landline, WiFi, mobile, etc.
XML, etc.
Logical data arch.
Business
Architecturally different
Agency “A” Agency “B”
A business has an architecture!
© Copyright 2008 Neils Michael Scofield all rights reserved
118
Linking organizations together: “Ha!”Linking organizations together: “Ha!”
DataApplication Database
Business
DataApplication Database
Semantic compatibility.
Logical data arch.
Business
Architecturally different
Agency “A” Agency “B”
Semantic compatibility:
Presence of data elementsField format compatibilityDefinitional consistencyKeys don’t clash (homonyms, non-reuse, etc.)
Subject entities have similar life cycles
These are subtle, abstract concepts. Not understood by executives or hardware people.
© Copyright 2008 Neils Michael Scofield all rights reserved
119
Two levels of architecture matchingTwo levels of architecture matching
DataApplication Database
Business
DataApplication Database
Semantics & meaning
Structural architecture
Logical data arch.
Business
Architecturally different
Agency “A” Agency “B”
© Copyright 2008 Neils Michael Scofield all rights reserved
120
Semantic and meaning (field level)Semantic and meaning (field level)
Two fields (in two environments) can have…Two fields (in two environments) can have…same name, same format, same name, same format, but but different domaindifferent domain. .
Source-A Source-B
© Copyright 2008 Neils Michael Scofield all rights reserved
121
Semantic and meaning (table level)Semantic and meaning (table level)
Two tables (in two environments) can have… Two tables (in two environments) can have… same name, same format same name, same format (and column list),(and column list), but but different scopedifferent scope or entity meaning. or entity meaning.
Source-A Source-B
Customer orders
Customer orders
same format
unfilled
month total
different scope
© Copyright 2008 Neils Michael Scofield all rights reserved
122
What do we mean by “link”? What do we mean by “link”?
Replicate data instantly (at time of transaction)
Reposit data into an ODS(at time of transaction)
1
2
Appl. 1
Appl. 2
Application databases
Appl. 1
Appl. 2ODS
Reposit data into a data warehouse(periodic, in batch)
1
2
Appl. 1
Appl. 2
Datawhse
Application databases
© Copyright 2008 Neils Michael Scofield all rights reserved
123
Instant, transactional, replication Instant, transactional, replication
1 2
Appl. 1 Appl. 2
API API
Exchange services
Are the architectures compatible?
Probably not!
© Copyright 2008 Neils Michael Scofield all rights reserved
124
Semantic integrationSemantic integration means bringing the means bringing the data together so it makes sense.data together so it makes sense.
Total logical data archiecture level
Presence or absences of entities / tables Cardinalities
Table (subject entity) level
Definitions are the same Field list are the same
Column (field) level
Formats are the same Business definitions are the same Domains & meanings are the same
=A B
=
Cust-A Cust-B
=A B
© Copyright 2008 Neils Michael Scofield all rights reserved
125
Data integration involves matching Data integration involves matching “things” from multiple sources “things” from multiple sources
Instance level:
Person
Store
Address
Vehicle
Neighborhood
Event (or episode)
Dimension level:
Time period
Brand or product
Market
Category (“type”, “class”)
Geography
Other grouping
Benefits from “singular” characteristic of entity
Problematic matching between sources
© Copyright 2008 Neils Michael Scofield all rights reserved
126
New York media market
Northern New Jersey sales zone
Long Island Sales Zone
Central NJ sales zone
Metro NY sales zone
© Copyright 2008 Neils Michael Scofield all rights reserved
127
Name & address formatsName & address formats
Are you going to do name and/or address Are you going to do name and/or address matching?matching?
Many causes of non-matches. Many causes of non-matches.
“They will have the name and addresses in the record.”
“Oh, that’s fine.”
© Copyright 2008 Neils Michael Scofield all rights reserved
128
Address formats -- parsingAddress formats -- parsing
First Name Last name M.I. Number Street AptCharles Shepard A 563-A Pine StreetSusan Elkart G 78 Mills Lane CEvelyn Barnard R 587 Canal St.Frankling Turing S 798 Wisconsin Ave.
Customer Name Address 1Charles Shepard 563-A Pine StreetSusan Elkart 78 Mills Ln, Apt. CEvelyn Barnard 587 Canal St.Franklin Turing 798 Wisconsin Ave.
Source format:
…to be matched to…
Target format
05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).
05 FIRST_NAME PIC X(20)05 LAST_NAME PIC X(25)05 MIDDLE_INIT PIC X(01).05 STR_NUMBER PIC X(10).05 STREET_NAME PIC X(30). 05 APT_NO PIC X(10).
© Copyright 2008 Neils Michael Scofield all rights reserved
129
Address formats – parsing (2)Address formats – parsing (2)
Customer Name Address 1Charles Shepard 563-A Pine StreetSusan Elkart 78 Mills Ln, Apt. CEvelyn Barnard 587 Canal St.Franklin Turing 798 Wisconsin Ave.
Source format:
…to be matched to…
Target format
Are these going to match?
Customer Name Address 1Shepard, Charles 563-A Pine StreetElkart, Susan 78 Mills Ln, Apt. CBarnard, Evelyn 587 Canal St.Turing, Franklin 798 Wisconsin Ave.
05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).
05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).
© Copyright 2008 Neils Michael Scofield all rights reserved
130
Address formats – parsing (3)Address formats – parsing (3)
Customer Name Address 1Charles Shepard 563-A Pine StreetSusan Elkart 78 Mills Ln, Apt. CEvelyn Barnard 587 Canal St.Franklin Turing 798 Wisconsin Ave.
Source format:
…to be matched to…
Target format
Are these going to match?
05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).
05 CUST_NAME PIC X(30).05 ADDRESS_1 PIC X(40).
Ironically…
The meaning is the same, but the data is different.!
© Copyright 2008 Neils Michael Scofield all rights reserved
131
Address formats – parsing (3)Address formats – parsing (3)
None.
Which of these will match in native SQL?
Source Target563 A Pine St. 563-A Pine St.587 Canal Street 587 Canal St.798 Wisconsin Ave. 798 Wisconsin781 Mills Lane 781 Mills Ln.418 Elm St. Apt. C 418 Elm, Apt. C21 Valley Forge Ave. 21 ValleyForge Ave.
© Copyright 2008 Neils Michael Scofield all rights reserved
132
Conclusion on data architecture:Conclusion on data architecture:
Even if you have an exact physical format Even if you have an exact physical format match… match…
… …source to target…source to target…
Field namesField names Field format Field format
The contents may not match.The contents may not match.
And the meaning may not match.And the meaning may not match.
© Copyright 2008 Neils Michael Scofield all rights reserved
133
Semantics Semantics and and
meaningmeaning
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
134
Key questions: Key questions: Are the languages of text fields, and the character set Are the languages of text fields, and the character set appropriate to your needs?appropriate to your needs?
Are numeric fields in units-of-measure which you Are numeric fields in units-of-measure which you expect? expect?
How is the “null” condition symbolized in each field?How is the “null” condition symbolized in each field?
Is it clear what the business meaning of the null Is it clear what the business meaning of the null condition is? condition is?
What fields need to be translated into your desired What fields need to be translated into your desired coding domain? coding domain?
Does the meaning of any field Does the meaning of any field (or elements of its domain)(or elements of its domain) change over time or over any other scope dimension?change over time or over any other scope dimension?
© Copyright 2008 Neils Michael Scofield all rights reserved
135
Potential coding variationsPotential coding variations
State FIPS AbbrAlabama 01 ALAlaska 02 AKArizona 04 AZArkansas 05 ARCalifornia 06 CAColorado 08 COConnecticut 09 CTDelaware 10 DEDistrict of Columbia 11 DCFlorida 12 FLGeorgia 13 GAHawaii 15 HI
© Copyright 2008 Neils Michael Scofield all rights reserved
136
Thin documentation can be misleadingThin documentation can be misleading
““Address”Address”
Current address?Current address?
Current address for mailing purposesCurrent address for mailing purposes
but not for billing purposes. but not for billing purposes.
Current address for delivery purposesCurrent address for delivery purposes
but not for mailing, or billing. but not for mailing, or billing.
© Copyright 2008 Neils Michael Scofield all rights reserved
137
““Null” Null” Though the “null” value may be stored in the Though the “null” value may be stored in the original database, … original database, …
……will it be transferred effectively through the will it be transferred effectively through the ETL process?ETL process?
There is also the question: “Why is it null?”
That answer can be another kind of metadata.
1. Not applicable
2. Declined to state
3. Will be supplied later
© Copyright 2008 Neils Michael Scofield all rights reserved
138
Testing for semantic discontinuitiesTesting for semantic discontinuities
Fields may change meaning over time (or other dimensions)
Codes may change meaning over time
Every code is potentially volatile over time.
Invoice typeAccount typeCustomer numberSales division
Stable codes tend to be OUTSIDE the organization…
…e.g. standard govt codes.
© Copyright 2008 Neils Michael Scofield all rights reserved
139
Are Domains Stable Over Time?Are Domains Stable Over Time?Customer File: Invoice Type Code Customer File: Invoice Type Code
INVOICE_TYPE_CODEINVOICE_TYPE_CODE XTAB AGAINST MONTH XTAB AGAINST MONTH
MONTH MONTH 01 02 03 04 05 06 07 01 02 03 04 05 06 07------------------------------------------------------------------------------------------------------------------AA 87 91 96 78 88 92 97AA 87 91 96 78 88 92 97BB 142 148 153 162 149 167 173BB 142 148 153 162 149 167 173CC 197 204 211 225 0 0 0CC 197 204 211 225 0 0 0DD 45 48 51 47 46 48 49DD 45 48 51 47 46 48 49EE 77 76 81 79 84 82 79EE 77 76 81 79 84 82 79F1 4 3 8 5 9 7 11F1 4 3 8 5 9 7 11F2 9 8 4 7 12 9 8F2 9 8 4 7 12 9 8------------------------------------------------------------------------------------------------------------------------
Type “CC” not consistently used over time.
© Copyright 2008 Neils Michael Scofield all rights reserved
140
Are the codes consistent over time?Are the codes consistent over time?
Cust. 41
Cust. 8
Cust. 21
Cust. 11
Cust. 5
Cust. 6
Cust. 28Cust. 24
Cust. 29
Cust. 19
Cust. 16
Cust. 3
Cust. 7
© Copyright 2008 Neils Michael Scofield all rights reserved
141
Customers are grouped into regions.Customers are grouped into regions.
Cust. 41
Cust. 8
Cust. 21
Cust. 11
Cust. 5
Cust. 6
Cust. 28Cust. 24
Cust. 29
Cust. 19
Cust. 16
Cust. 3
Cust. 7
Region 3
Region 1
Region 2
© Copyright 2008 Neils Michael Scofield all rights reserved
142
Regions get “redefined” - “realigned”.Regions get “redefined” - “realigned”.
Cust. 41
Cust. 8
Cust. 21
Cust. 11
Cust. 5
Cust. 6
Cust. 28Cust. 24
Cust. 29
Cust. 19
Cust. 16
Cust. 3
Cust. 7
Region 1
Region 2
Region 3
If this happens…
…will the source tell you?
Can you detect it on your own?
© Copyright 2008 Neils Michael Scofield all rights reserved
143
Documentation Documentation & &
metadatametadata
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
144
Key questions: Key questions: Has format and meaning documentation been Has format and meaning documentation been provided prior to your decision to acquire the provided prior to your decision to acquire the data?data?
Is the documentation current? Is the documentation current?
Can you get sample data to test against? Can you get sample data to test against?
Is the documentation thorough and in sufficient Is the documentation thorough and in sufficient detail?detail?
Does the documentation include data quality Does the documentation include data quality standards?standards?
© Copyright 2008 Neils Michael Scofield all rights reserved
145
Documentation topicsDocumentation topics
Format and structureFormat and structure
Meaning of fields and segmentsMeaning of fields and segments
Language & units of measureLanguage & units of measure
Entity life cycle and extract filtersEntity life cycle and extract filters
Scope Scope
Vintage (date ranges) Vintage (date ranges)
Projections (GIS)Projections (GIS)
Reference (GIS)Reference (GIS)
Function of program
code
Function of job
parameters
Traditional “normative” data documentation covers only this.
© Copyright 2008 Neils Michael Scofield all rights reserved
146
Format alone does not describe dataFormat alone does not describe data
Batch job step INV04G
Format-A Format-B
Same scope and timing, but different format.
Batch job step INV21K
California ArizonaUtah N.M.
Same format, but different scopes
© Copyright 2008 Neils Michael Scofield all rights reserved
147
Data documentation must be more Data documentation must be more than format … much more.than format … much more.
Format(s)
Contentand
meaning tangible data file
Metadata“data about data”“information about data”“information about information”
Many kinds of metadata!
Industry and cultural contexts.
The word, “metadata” is inherently ambiguous.
© Copyright 2008 Neils Michael Scofield all rights reserved
149
Technical vs. business metadataTechnical vs. business metadata
01 CUSTOMER_MASTER. 05 CUST_NUM PIC X(08). 05 CUST_NAME PIC X(30).
05 ADDRESS_1 PIC X(30).
05 ADDRESS_2 PIC X(30). 05 CITY PIC X(25).
05 STATE PIC X(02).
“Customers in this file include…
current active customersprospective customersdormant customersrecipients of samples
Other subtypes include:
industrial vs. retaildomestic vs. internationalbroker vs. directplatinum vs. regular”Well-structured,
Machine-readable Unstructured, meaningful only to a human.
© Copyright 2008 Neils Michael Scofield all rights reserved
150
Normative vs. dynamic metadataNormative vs. dynamic metadata
If the file is being updated, then source-ID and quality are NOT characteristics of the entire table.
Source-A
Source-B
Observed in 1985
Observed in 2003
Low quality
High quality
This has nothing to do with structural metadata.
© Copyright 2008 Neils Michael Scofield all rights reserved
151
Record-level metadataRecord-level metadata
all non-key data acquired as a single unit
source of all info in this record
when record created or updated
ID Name Street Addr City / St Source Updt Dt489735 John Smith 971 Pine Drive Portland, ME CA DMV 8/2/1997489735 Mary Allard 6174 Huron St. Albany, NY NY DMV 4/13/2003489735 Ty Kobb 572 Ottawa Boston, MA US Army 4/14/2003
Source and update date for the whole record (all fields)
© Copyright 2008 Neils Michael Scofield all rights reserved
152
Credit bureau record on person
Imbedded metadata – cell levelImbedded metadata – cell level
Person ID Name SSN SSN src SSN updt DOB DOB src DOB updt489735 John Smith 587-98-1473 US Army 4/15/2001 4/3/1952 CA DMV 8/2/1997489735 Mary Allard 589-88-8891 CitiBank 2/2/1997 3/9/1972 NY DMV 4/13/2003489735 Ty Kobb 433-52-8743 Chase 57 6/2/2004 4/15/1978 US Army 4/14/2003
fact
where we got the fact
when we got the fact
Some facts are acquired individually, unrelated to peer cells in a record.
These 3 data elements belong together.
© Copyright 2008 Neils Michael Scofield all rights reserved
153
Where is metadata stored?Where is metadata stored?
Central Metadata Repository
Complex dataset
Metadata in XML
Metadata in 3-ring binder
Classic data management problem:
Two copies of knowledge, no rigorous enforcement of refresh and update.
scattered
copy
copy
© Copyright 2008 Neils Michael Scofield all rights reserved
154
Documentation standardsDocumentation standardsEasy to establish, sometimes reluctant to fulfill. Easy to establish, sometimes reluctant to fulfill.
Letter but not the spirit of documentation. Letter but not the spirit of documentation.
Nobody wants to write documentationNobody wants to write documentation
INVOICE_AMT DECIMAL (11.2) Def. Total amount of the invoice.
Tautological: The use of redundant language
"If you don't get any better, you'll never improve" --Yogi Berra
INVOICE_AMT DECIMAL (11.2) Def. This data element contains the total invoice amount.
© Copyright 2008 Neils Michael Scofield all rights reserved
155
Documentation standardsDocumentation standardsEasy to establish, sometimes reluctant to fulfill. Easy to establish, sometimes reluctant to fulfill.
Letter but not the spirit of documentation. Letter but not the spirit of documentation.
Nobody wants to write documentationNobody wants to write documentation
INVOICE_AMT DECIMAL (11.2)
Def. The total amount to be paid on a regular invoice to the customer; equals the sum of all extended costs of line items net of discounts. Also includes special charges unrelated to specific products. Always in U.S. dollar.
On invoice reversals, this field is normally negative. On credit memos, this field is normally negative.
Good metadata discusses the anomalies!
© Copyright 2008 Neils Michael Scofield all rights reserved
156
Scope & Scope & completenesscompleteness
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
157
Key questions: ScopeKey questions: ScopeAre you getting all the attributes (fields, columns, data Are you getting all the attributes (fields, columns, data elements) which you expect?elements) which you expect?
Are you getting other attributes you didn’t ask for?Are you getting other attributes you didn’t ask for?
Are you getting all the records you expected?Are you getting all the records you expected?
Are you getting any records outside of your scope of Are you getting any records outside of your scope of request or interest?request or interest?
For each field, is the column populated as completely as For each field, is the column populated as completely as is appropriate? is appropriate?
© Copyright 2008 Neils Michael Scofield all rights reserved
158
Are you getting the data you expect to get? Are you getting the data you expect to get?
ScopeScope
GeographyGeography
TimeTime
Range of customers by name, account, etc.Range of customers by name, account, etc.
Is there any way your source might have truncated Is there any way your source might have truncated your input data? your input data?
© Copyright 2008 Neils Michael Scofield all rights reserved
159
Kinds of scopeKinds of scope
Scope in timeScope in time
Scope in geographyScope in geography
Organizational scopeOrganizational scope
Types or subtypes of major entitiesTypes or subtypes of major entities
Entity life cycle and duplication Entity life cycle and duplication
S M T W T F S
© Copyright 2008 Neils Michael Scofield all rights reserved
160
Time scopeTime scope Tally data over timeTally data over time
YEAR 1998 1999 2000 2001MONTH ----------------------------------01 492 742 711 84102 512 701 782 81203 588 689 733 84504 522 0 746 82905 581 618 697 79206 566 682 709 84107 599 623 728 82308 492 593 692 78409 509 608 717 82410 527 631 729 78111 488 597 744 80712 611 845 714 892
S M T W T F S
TABLE FILE GG1SUM RECCNTACROSS MONTHBY YEAREND-RUN
© Copyright 2008 Neils Michael Scofield all rights reserved
161
Time scope Time scope (cont.)(cont.)
Records by month
0.0
100.0
200.0
300.0
400.0
500.0
600.0
Jan-
06
Feb-0
6
Mar
-06
Apr-0
6
May
-06
Jun-
06
Jul-0
6
Aug-0
6
Sep-0
6
Oct-06
Nov-0
6
Dec-0
6
Jan-
07
Feb-0
7
Mar
-07
Apr-0
7
May
-07
Jun-
07
Jul-0
7
Aug-0
7
Sep-0
7
Oct-07
Nov-0
7
Dec-0
7
Probably a discontinuity in definition, inclusion criteria, or scope.
S M T W T F S
© Copyright 2008 Neils Michael Scofield all rights reserved
162
Cropping Cropping
Count records by monthCount records by month
Orders by month
0
200
400
600
800
1,000
1,200
Janu
ary
Febru
ary
Marc
hApr
ilM
ayJu
ne July
Augus
t
Septem
ber
Octobe
r
Novem
ber
Decem
ber
Janu
ary
Febru
ary
Marc
hApr
ilM
ayJu
ne July
Order History Table
A purge process exists, but some records had to remain (still outstanding dispute).
S M T W T F S
You may get more records than you expect!
© Copyright 2008 Neils Michael Scofield all rights reserved
163
Kinds of scope:Kinds of scope: Geography Geography
Adair County
Baker County
Evans County
Girard County
Caswell County
Duke County
Johnson County
DATASET COVERAGE
© Copyright 2008 Neils Michael Scofield all rights reserved
164
Kinds of scope:Kinds of scope: Organizational Organizational
All the divisions, or just some? All the divisions, or just some?
All the sales, or just sales by employee sales reps All the sales, or just sales by employee sales reps (thus excluding broker-negotiated sales)?(thus excluding broker-negotiated sales)?
Domestic activity only or including international?Domestic activity only or including international?
GargantuanIndustries, Inc.
Mining &minerals
ToysDefense
& weaponsHealth &
beauty aids
© Copyright 2008 Neils Michael Scofield all rights reserved
165
Kinds of scope:Kinds of scope: Types and subtypes Types and subtypes
Vehicle file includes…Vehicle file includes…
Owned vehicles but not leased vehiclesOwned vehicles but not leased vehicles
Cars but not utility trucksCars but not utility trucks
Dataset of employeesDataset of employees
Full-time but not part-timeFull-time but not part-time
Current but not former employeesCurrent but not former employees
Volunteers? Volunteers?
© Copyright 2008 Neils Michael Scofield all rights reserved
166
Subtypes of a hospital employee entity Subtypes of a hospital employee entity
EmployeeEmployee[Emp Num][Emp Num]
CandidatesCandidates ActiveActiveempl.empl.
FormerFormerempl.empl.
DoctorsDoctors ContractContractempl.empl.
Are subtypes mutually exclusive?
Are some data fields present for some, but not all subtypes?
© Copyright 2008 Neils Michael Scofield all rights reserved
167
Subtypes of a hospital employee entity Subtypes of a hospital employee entity
EmployeeEmployee[Emp Num][Emp Num]
CandidatesCandidates ActiveActiveempl.empl.
FormerFormerempl.empl.
DoctorsDoctors ContractContractempl.empl.
Perm. Perm. Full-timeFull-time
TemporaryTemporary
Subtypes can have subtypes
© Copyright 2008 Neils Michael Scofield all rights reserved
168
Kinds of scope:Kinds of scope: Entity life cycle & duplication Entity life cycle & duplication
Big issue: mutual exclusivity of records, vs. duplication
Can the same instance be represented by multiple records…
…possibly in multiple stages of its life cycle?
Are all records logical peers to each other?
© Copyright 2008 Neils Michael Scofield all rights reserved
169
Students at a university Students at a university
StudentStudent[ID Num][ID Num]
FreshmanFreshman SophomoreSophomore JuniorJunior SeniorSenior GraduateGraduate
In reality (business policy), mutually exclusive?
In a file of students, are you getting only one record per student? ….
Or, one record per student-year?
© Copyright 2008 Neils Michael Scofield all rights reserved
170
Students at a university Students at a university
StudentStudent[ID Num][ID Num]
FreshmanFreshman SophomoreSophomore JuniorJunior SeniorSenior GraduateGraduate
This gets us back to architecture.
What distinct subject entity does a record represent?
Name change between academic years?
© Copyright 2008 Neils Michael Scofield all rights reserved
171
Other fragments of scopeOther fragments of scope
MOST COMMON VALUES OF LAST_NAMEMOST COMMON VALUES OF LAST_NAME--------------------------------------------------------------BROWN 12,943BROWN 12,943DAVIS 9,542DAVIS 9,542ANDERSON 7,227ANDERSON 7,227CLARK 5,344CLARK 5,344ALLEN 4,715ALLEN 4,715CAMPBELL 4,014CAMPBELL 4,014ADAMS 3,800ADAMS 3,800BAKER 3,635BAKER 3,635EVANS 3,271EVANS 3,271COLLINS 3,180COLLINS 3,180CARTER 3,143CARTER 3,143EDWARDS 3,129EDWARDS 3,129COOK 2,772COOK 2,772COOPER 2,646COOPER 2,646
What’s wrong with this picture?
File of prospective customers from outside source
© Copyright 2008 Neils Michael Scofield all rights reserved
172
Look at distribution of first character of text Look at distribution of first character of text fields! fields!
NameName
AddressAddress
CityCity
CommentsComments
Need query tool which can create new Need query tool which can create new variables (fields) based on mask. variables (fields) based on mask.
© Copyright 2008 Neils Michael Scofield all rights reserved
173
Distribution of first character, Last Name field, Distribution of first character, Last Name field, purchased input file. purchased input file.
A 8,240 N 86A 8,240 N 86B 31,210 O 47B 31,210 O 47C 17,221 P 24C 17,221 P 24D 10,929 Q 13D 10,929 Q 13E 4,507 R 14E 4,507 R 14F 8,081 S 4F 8,081 S 4G 77 T 23G 77 T 23H 71 U 21H 71 U 21I 8 V 13I 8 V 13J 63 W 2J 63 W 2K 36 X 1K 36 X 1L 94 Y 5L 94 Y 5M 82 Z 7M 82 Z 7
© Copyright 2008 Neils Michael Scofield all rights reserved
174
Reasonable surname (1st character) Reasonable surname (1st character) distribution in American society.distribution in American society.
A 8,240 N 4,486A 8,240 N 4,486B 31,210 O 3,347B 31,210 O 3,347C 17,221 P 11,724C 17,221 P 11,724D 10,929 Q 513D 10,929 Q 513E 4,507 R 12,864E 4,507 R 12,864F 8,081 S 23,604F 8,081 S 23,604G 11,977 T 8,623G 11,977 T 8,623H 17,171 U 571H 17,171 U 571I 1,008 V 3,453I 1,008 V 3,453J 7,163 W 14,302J 7,163 W 14,302K 8,636 X 36K 8,636 X 36L 11,094 Y 1,435L 11,094 Y 1,435M 21,682 Z 1,187M 21,682 Z 1,187
© Copyright 2008 Neils Michael Scofield all rights reserved
175
Accidental truncation of dataAccidental truncation of data
What ways can your source truncate your data?What ways can your source truncate your data?
NameName
Organizational Organizational (e.g. forgot broker sales)(e.g. forgot broker sales)
Life cycle Life cycle (e.g. forgot former employees)(e.g. forgot former employees)
Time Time (clipped in creation date range, but not ship date)(clipped in creation date range, but not ship date)
© Copyright 2008 Neils Michael Scofield all rights reserved
176
Are you getting MORE data than you Are you getting MORE data than you wanted?wanted?
Test records Test records
Beyond original scopeBeyond original scope
© Copyright 2008 Neils Michael Scofield all rights reserved
177
Detecting duplicate dataDetecting duplicate data
Domain each key Domain each key (if it is a truly unique key)(if it is a truly unique key)
SELECT CUST_KEY, REC_COUNT FROM SELECT CUST_KEY, COUNT(*) AS REC_COUNT FROM CUST_MAST GROUP BY CUST_KEY;ORDER BY REC_COUNT DESCENDING;
CUST_KEY REC_COUNT-------------------004001 1004002 1004003 1004004 1004005 1
CUST_KEY REC_COUNT-------------------004127 5004039 4004113 4004834 3004225 3
Desired results Duplicate keys
© Copyright 2008 Neils Michael Scofield all rights reserved
178
Detecting duplicate data Detecting duplicate data (cont.)(cont.)
Test for duplicate data, with non-dup keysTest for duplicate data, with non-dup keysCust.No. Cust name Addr City State ZIP DOB
4127 Tony Martinez 77 River St. Phoenix AZ 87114 8/4/1952
4127 Tony Martinez 77 River St. Phoenix AZ 87114 8/4/1952
Cust.No. Cust name Addr City State ZIP DOB
5793 Angela Connors 29 High St Flagstaff AZ 87114 8/4/1952
6778 Angela Connors 29 High St Flagstaff AZ 87114 8/4/1952
Different key Same person
Entire record duplicated
© Copyright 2008 Neils Michael Scofield all rights reserved
179
Incremental updatesIncremental updates
Order # Cust # Order Dt Delivery Dt Total chg Update dt10001 1234 1/5/2006 1/15/2006 1489.14 1/5/200610002 1343 1/8/2006 1/18/2006 874.82 1/8/200610003 1344 1/15/2006 1/25/2006 1378.25 1/15/200610004 1580 1/28/2006 2/8/2006 1184.82 1/28/2006
Customer OrderTable
January activity: 4 records
This is typical for extract to a data warehouse.
Customer Order
Source application
Datawarehouse
extracttranslate &
load
update file
© Copyright 2008 Neils Michael Scofield all rights reserved
180
Incremental updatesIncremental updatesCustomer Order
TableJanuary activity: 4 records
Order # Cust # Order Dt Delivery Dt Total chg Update dt10001 1234 1/5/2006 1/15/2006 1489.14 1/5/200610002 1343 1/8/2006 1/18/2006 874.82 1/8/200610003 1344 1/15/2006 1/25/2006 1378.25 1/15/200610004 1580 1/28/2006 2/8/2006 1287.01 2/4/2006
February activity: 4 records
10005 1344 2/4/2006 2/13/2006 1489.14 2/4/200610006 1580 2/7/2006 2/12/2006 874.82 2/7/200610007 1234 2/16/2006 2/26/2006 1378.25 2/16/200610008 1343 2/27/2006 3/7/2006 1184.82 2/27/2006
A change had been made to order 10004.Posted Feb. 4, AFTER the incremental extract for January data.
© Copyright 2008 Neils Michael Scofield all rights reserved
181
Incremental updatesIncremental updatesCustomer Order
Table
Order # Cust # Order Dt Delivery Dt Total chg Update dtU 10004 1580 1/28/2006 2/8/2006 1287.01 2/4/2006N 10005 1344 2/4/2006 2/13/2006 1489.14 2/4/2006N 10006 1580 2/7/2006 2/12/2006 874.82 2/7/2006N 10007 1234 2/16/2006 2/26/2006 1378.25 2/16/2006N 10008 1343 2/27/2006 3/7/2006 1184.82 2/27/2006
February change file:
How it should look.
“U” Update“N” New
Dilemma: How far back into history must you look to be sure you have all the changes posted in February?
This places a burden on the source system!
© Copyright 2008 Neils Michael Scofield all rights reserved
182
Overlap of date ranges of update filesOverlap of date ranges of update files
Jan. Feb. Mar. Apr. May
Jan.
February
March
April
May14 days back into previous month
© Copyright 2008 Neils Michael Scofield all rights reserved
183
Scope: Are all records peers to each other?Scope: Are all records peers to each other?
Can you have detail and summary records co-existing?Can you have detail and summary records co-existing?
Census BlockTract Group Block Tot Pop White Black Latino
471 15 ALL 1,804 1,593 176 35471 15 1 16 11 3 2471 15 2 10 8 1 1471 15 3 421 377 29 15471 15 4 381 329 47 5471 15 5 557 519 38 0471 15 6 419 349 58 12
aggregate recorddetail records
© Copyright 2008 Neils Michael Scofield all rights reserved
184
Beware of masking data for political or Beware of masking data for political or confidentiality reasons.confidentiality reasons.
Census BlockTract Group Block Tot Pop White Black Latino
471 15 ALL 1,804 1,593 176 35471 15 1 16 11 blocked blocked471 15 2 10 8 blocked blocked471 15 3 421 377 29 15471 15 4 381 329 47 5471 15 5 557 519 38 0471 15 6 419 349 58 12
Cells with data masked because total figure is too low.
© Copyright 2008 Neils Michael Scofield all rights reserved
185
World economic statisticsWorld economic statistics
Source: CIA web site
Rank Country Exports1 World $10,330,000,000,000 2 European Union $1,318,000,000,000 3 Germany $1,016,000,000,000 4 United States $927,500,000,000 5 China $752,200,000,000 6 Japan $550,500,000,000 7 France $443,400,000,000 8 United Kingdom $372,700,000,000 9 Italy $371,900,000,000
10 Netherlands $365,100,000,000 11 Canada $364,800,000,000 12 Korea, South $288,200,000,000 13 Hong Kong $286,300,000,000 14 Belgium $269,600,000,000 15 Russia $245,000,000,000 16 Mexico $213,700,000,000
© Copyright 2008 Neils Michael Scofield all rights reserved
186
Reference data & foreign keysReference data & foreign keysCodes need interpretations!Codes need interpretations!
Two places to do it:Two places to do it: Documentation Documentation Active reference tables. Active reference tables.
master files(kernel-stable)
transactions(events)
reference tables(validation)
© Copyright 2008 Neils Michael Scofield all rights reserved
187
Reference data & foreign keysReference data & foreign keys
small domain large domain
low volatility
Volatile
GenderU.S. states
countries
customer cd
vendor cd
product cd
facilityinvoice type
ICD-9DRG
transaction type
employee
© Copyright 2008 Neils Michael Scofield all rights reserved
188
Inconsistent coverageInconsistent coverage
““We did door-to-door We did door-to-door interviews in the towns, but interviews in the towns, but we are only estimating the we are only estimating the rural areas of the county.”rural areas of the county.”
Sampling and projection.Sampling and projection.
Examples of sampling and Examples of sampling and projection:projection:
Exit pollsExit polls
Radio & TV audienceRadio & TV audience
More reliable statistics:
Journal & newspaper subs
Web ad viewing
© Copyright 2008 Neils Michael Scofield all rights reserved
189
Detecting estimatesDetecting estimates
Look at frequently-occurring values
MOST COMMON VALUES OF PLACE POPULATION
POPULATION RECORDS ------------------------------- 25 9,542 100 7,227 200 5,344 50 4,715 150 4,014 300 3,635 250 3,180 400 3,143 125 3,129 120 2,772 40 2,573
Domain study, most frequent observed values.
© Copyright 2008 Neils Michael Scofield all rights reserved
190
Detecting estimatesDetecting estimatesPopulation Records
1 42 33 84 125 166 217 288 329 38
10 18511 4512 4713 5214 5515 8516 6117 6618 6819 7220 198
Look at low end of value range.
Records for each population statistic
1 2 3 4 5 6 7 8 9
10
11121314
1516171819
20
21222324
25
26272829303132
0
50
100
150
200
250
300
Spikes suggest estimates
© Copyright 2008 Neils Michael Scofield all rights reserved
191
Detecting estimates Detecting estimates (cont.)(cont.) Day of month in dates:Day of month in dates:
Date of birthDate of birth
Filing dateFiling date
Posting datePosting date
Analysis requires a query tool which will extract day of month
Record count by day of month
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
© Copyright 2008 Neils Michael Scofield all rights reserved
192
Detecting estimates Detecting estimates (cont.)(cont.) Record count by day of month
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Record count by day of month
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
© Copyright 2008 Neils Michael Scofield all rights reserved
193
Key questions: Best available?Key questions: Best available?
How does this dataset compare in quality How does this dataset compare in quality with alternate sources? with alternate sources?
QualityQuality
CurrencyCurrency
GranularityGranularity
PricePrice
© Copyright 2008 Neils Michael Scofield all rights reserved
194
Ask questions!Ask questions! Obvious but sound insultingObvious but sound insulting..
How many employees do you have?How many employees do you have?How many records in the Employee File How many records in the Employee File you are sending us?you are sending us?
Are there any duplicate records in your Are there any duplicate records in your file?file?How are they duplicate? Why?How are they duplicate? Why?In a business sense, what does that In a business sense, what does that mean?mean?
© Copyright 2008 Neils Michael Scofield all rights reserved
195
Fundamentals Fundamentals of data qualityof data quality
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
196
Key questions: Key questions: Are the values observed in each column valid?Are the values observed in each column valid?
For codes, to they conform to a consistent domain?For codes, to they conform to a consistent domain?
For quantities, do they conform to a reasonable range?For quantities, do they conform to a reasonable range?
For quantities, are there any significant outliers?For quantities, are there any significant outliers?
Are the values observed in each column reasonable (given Are the values observed in each column reasonable (given context)?context)?
Are the values in each column accurate? Are the values in each column accurate?
Is the definition for each field (and the data contained therein) Is the definition for each field (and the data contained therein) consistent over the entire dataset?consistent over the entire dataset?
What is the precision of each numeric field?What is the precision of each numeric field?
Is that precision consistent over the entire dataset? Is that precision consistent over the entire dataset?
How does this dataset compare in quality with alternate sources? How does this dataset compare in quality with alternate sources?
© Copyright 2008 Neils Michael Scofield all rights reserved
197
Again:Again: Defining data quality. Defining data quality.
““High quality data accurately High quality data accurately describes reality, according describes reality, according to its complete definition.” to its complete definition.”
--Michael Scofield--Michael Scofield
© Copyright 2008 Neils Michael Scofield all rights reserved
198
Components of data qualityComponents of data quality
Instance (row) present? Instance (row) present? (issue of scope of entire file)(issue of scope of entire file)
?
© Copyright 2008 Neils Michael Scofield all rights reserved
199
Components of data qualityComponents of data quality
Instance (row) present? Instance (row) present? (issue of scope of entire file)(issue of scope of entire file)
Cell populated? Cell populated? (need to recognize null condition)(need to recognize null condition)
?
© Copyright 2008 Neils Michael Scofield all rights reserved
200
Components of data qualityComponents of data quality
Instance (row) present? Instance (row) present? (issue of scope of entire file)(issue of scope of entire file)
Cell populated? Cell populated? (need to recognize null condition)(need to recognize null condition)
Is value in cell valid? Is value in cell valid? (compare against rules)(compare against rules)
Is value in cell reasonable? Is value in cell reasonable? (requires context)(requires context)
Is value in cell accurate? Is value in cell accurate? (requires definition)(requires definition)
How precise is the data in the cell?How precise is the data in the cell?
Is value in cell current? Is value in cell current? (time dimension of definition)(time dimension of definition)
Is the definition consistent over all dimensions?Is the definition consistent over all dimensions?
© Copyright 2008 Neils Michael Scofield all rights reserved
201
““Completeness” of data Completeness” of data (first definition)(first definition)
Complete table Complete table (all the rows)(all the rows)
Incomplete table Incomplete table (some of the rows)(some of the rows)
70% complete70% complete
© Copyright 2008 Neils Michael Scofield all rights reserved
202
““Completeness” of data Completeness” of data (2nd definition)(2nd definition)
Complete table Complete table (all the fields)(all the fields)
Incomplete table Incomplete table (some of the fields)(some of the fields)
© Copyright 2008 Neils Michael Scofield all rights reserved
203
Don’t confuse validity with accuracy!Don’t confuse validity with accuracy!
Validity of data means it Validity of data means it conforms to rules.conforms to rules.
It is not necessarily It is not necessarily reasonable reasonable
. . . or accurate. . . . or accurate.
© Copyright 2008 Neils Michael Scofield all rights reserved
204
Reasonability may be evaluated in Reasonability may be evaluated in context.context.
City Temperature FCity Temperature F----------------------------------------------
PITTSBURGH 49PITTSBURGH 49
ERIE 95ERIE 95
CLEVELAND 96CLEVELAND 96
HARRISBURG 89HARRISBURG 89
PHILADELPHIA 88PHILADELPHIA 88
© Copyright 2008 Neils Michael Scofield all rights reserved
205
Kinds of reports for data analysisKinds of reports for data analysis
Goal: Give visibility to data behavior.Goal: Give visibility to data behavior.
1. Domain studies1. Domain studies
High value formatHigh value format Low value format Low value format
2. Inter-field dependency tests2. Inter-field dependency tests
3. Referential integrity tests3. Referential integrity tests
4. Formatted dumps4. Formatted dumps
5. Other reasonability tests5. Other reasonability tests
© Copyright 2008 Neils Michael Scofield all rights reserved
206
Where do you look at the data?Where do you look at the data?
Staging database. Staging database.
Exact replica of source data.Exact replica of source data.
ReplicaComplex
ETL
External data source
Simple ETL
Target database
describes
same
Data architectures
describes
Query tool
© Copyright 2008 Neils Michael Scofield all rights reserved
207
Update and Update and refresh issuesrefresh issues
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
208
Key questions: Key questions:
Is this the only dataset you are going to acquire over Is this the only dataset you are going to acquire over time? time?
Or, are you going to get new versions or updates? Or, are you going to get new versions or updates?
Will any updates you get be incremental (just the Will any updates you get be incremental (just the changes) or complete refreshment? changes) or complete refreshment?
Are you going to get corrections as soon as the source Are you going to get corrections as soon as the source knows about them?knows about them?
How will the source differentiate updates from How will the source differentiate updates from corrections? corrections?
Would they be found in the same dataset? Would they be found in the same dataset?
© Copyright 2008 Neils Michael Scofield all rights reserved
209
UpdateUpdate:: Incremental vs. full refreshIncremental vs. full refresh
Can you distinguish between legitimate changes vs. error corrections?
Can you detect changes or corrections? Do you need to know about them? (downstream propagation)
Saves time! Simple processing.
© Copyright 2008 Neils Michael Scofield all rights reserved
210
Data Data collection biascollection bias
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
211
Key questions: Key questions:
What was original the purpose of the data collection What was original the purpose of the data collection efforts? efforts?
What is your purpose or goal in acquiring the dataset? What is your purpose or goal in acquiring the dataset?
What is the business value of the data to you? What is the business value of the data to you?
What values and goals entered into the collection, What values and goals entered into the collection, organizing, and other preparation of the data? organizing, and other preparation of the data?
What purpose does the source/provider have in making What purpose does the source/provider have in making the data available? (profit, persuasion, altruistic)?the data available? (profit, persuasion, altruistic)?
© Copyright 2008 Neils Michael Scofield all rights reserved
212
Two kinds of data generationTwo kinds of data generation
Data as byproduct of business processes
Data as gathered as non-business research
commercial sector
banking manufacturingretail salescustomer service activities (utilities, communications, etc.) hospital patient records & billinginsurance policy setup and claimseducation: student enrollment, grades, etc.
governments
social welfare and public assistancetax collectioncity services (trash, utilities) votingpublic libraries (patron activity)
field surveys of land, topo, etc.
observations of external behavior: weather, oceanography, traffic, census, economics, astronomy, seismology, special interview-based studies
satellite & aerial imagery
Hybrid: strategic intelligence, police surveillance, mineral exploration, etc.
Remember this? Remember this?
© Copyright 2008 Neils Michael Scofield all rights reserved
213
Bias towards “certainty”Bias towards “certainty”
Coded fields on data-entry screens discourage Coded fields on data-entry screens discourage ambiguity…ambiguity…
… …and encourage illusion of precision. and encourage illusion of precision.
Premature entry of coded data results in non-nulls, Premature entry of coded data results in non-nulls,
… …and illusion of complete-ness. and illusion of complete-ness.
Tabular structures demand that you can’t say “about” Tabular structures demand that you can’t say “about” next to a piece of data.next to a piece of data.
Tabular data implies / expects precisionTabular data implies / expects precision
data ambiguity lite
© Copyright 2008 Neils Michael Scofield all rights reserved
214
““About” …About” …
Textual and non-tabular expression allow Textual and non-tabular expression allow imprecision. imprecision.
Q.Q. ““When does your plane leave?”When does your plane leave?”
R.R. ““About 3 PM.” About 3 PM.”
Q. “How old is the suspect?”Q. “How old is the suspect?”
A. “In his mid-forties.” A. “In his mid-forties.” You cannot tabularize that!
or, as they say in Canada…
data ambiguity lite
© Copyright 2008 Neils Michael Scofield all rights reserved
215
Do you understand the data Do you understand the data gathering process? gathering process?
What policies / methods / procedures / screens What policies / methods / procedures / screens introduce data errors?introduce data errors?
What are sources of bias?What are sources of bias?
© Copyright 2008 Neils Michael Scofield all rights reserved
216
Ownership, Ownership, usage, and usage, and
liabilityliability
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
217
Key questions: Key questions: ownership & usageownership & usage
Are you unlimited in the usage of the data?Are you unlimited in the usage of the data?
Can you resell the data?Can you resell the data?
Can you share the data with other Can you share the data with other organizations? organizations?
What restrictions are placed on you in using the What restrictions are placed on you in using the data?data?
© Copyright 2008 Neils Michael Scofield all rights reserved
218
Key questions: Key questions: ownership & usage ownership & usage (2)(2)
Does the source (person or organization) take Does the source (person or organization) take any responsibility for the quality, completeness, any responsibility for the quality, completeness, accuracy of the data? accuracy of the data?
Are there any implied limits to the “suitable” Are there any implied limits to the “suitable” usage of the data?usage of the data?
Are you planning on using the data for Are you planning on using the data for something it was not originally designed for?something it was not originally designed for?
© Copyright 2008 Neils Michael Scofield all rights reserved
219
ConfidentialityConfidentialityIntroduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
220
Key questions: confidentiality Key questions: confidentiality
Does the dataset contain any data which Does the dataset contain any data which (other (other than its proprietary nature)than its proprietary nature) should be considered should be considered confidential?confidential?
By what criteria? By what criteria?
Why might it be confidential or sensitive?Why might it be confidential or sensitive?
Who are the interested parties?Who are the interested parties?
© Copyright 2008 Neils Michael Scofield all rights reserved
221
Strategic and competitiveStrategic and competitiveEven the fact that you simply Even the fact that you simply havehave the data may be secret. the data may be secret.
Bougainville
Guadalcanal
Henderson Field
Ballale airfield
Adm. Yamamoto To cover up the fact that the Allies were reading Japanese code, American news agencies were told that civilian coast-watchers in the Solomons saw Yamamoto boarding a bomber in the area
Adm. Yamamoto-Isoroku
© Copyright 2008 Neils Michael Scofield all rights reserved
222
Data flow Data flow surveillancesurveillance
Introduction
Spelling out the Relationship
Data & information
Universe of knowledge
Data coming from bureaucracies
Asking for the right data
Potential data providers
Physical forms and media
Logical data architecture
Semantics & meaning
Documentation & metadata
Scope & completeness
Fund. of data quality
Update & refresh issues
Data collection bias
Ownership & legal
Confidentiality
Data flow surveillance
Conclusion
© Copyright 2008 Neils Michael Scofield all rights reserved
223
Key questions: Key questions: Can you count on your source to notify you about Can you count on your source to notify you about any changes in…any changes in…
logical data architecturelogical data architecture
scopescope
qualityquality
units of measureunits of measure
precisionprecision
biasbias
© Copyright 2008 Neils Michael Scofield all rights reserved
224
Designing for the FutureDesigning for the Future
Designing an on-going Surveillance Designing an on-going Surveillance Program for monitoring the stability of Program for monitoring the stability of source data behavior, source data behavior, quality, and quality, and meaning, meaning, and the appropriateness of your and the appropriateness of your mapping.mapping.
In other words….In other words….Preventing nasty surprises!Preventing nasty surprises!
© Copyright 2008 Neils Michael Scofield all rights reserved
225
Designing on-going data surveillance Designing on-going data surveillance to protect yourself in the future. to protect yourself in the future.
First time analysis is tedious.First time analysis is tedious.
Lots of exploration of the data. Lots of exploration of the data.
Lots of decisions. Lots of decisions.
Do you want to do it with every incremental update of Do you want to do it with every incremental update of the database? the database?
No, but you can’t assume that next month’s data will No, but you can’t assume that next month’s data will behave the same as the first tape. behave the same as the first tape.
Expect the unexpected. Expect the unexpected.
© Copyright 2008 Neils Michael Scofield all rights reserved
226
Goal of testing imported dataGoal of testing imported data
Protect yourself against injury because of errors or Protect yourself against injury because of errors or inconsistencies in imported data. inconsistencies in imported data.
Be aware of changes to meaning of incoming data. Be aware of changes to meaning of incoming data.
Be aware of changes of scope of incoming data.Be aware of changes of scope of incoming data.
Catch the problems as soon as possible…Catch the problems as soon as possible… …not during the database update process. …not during the database update process.
Hence, make the loading process as fast and smooth as Hence, make the loading process as fast and smooth as possible. possible.
Semper vigilansSemper vigilans
Always vigilantAlways vigilant
© Copyright 2008 Neils Michael Scofield all rights reserved
227
The challenge of imported dataThe challenge of imported data
Each piece of data is an observation about reality, Each piece of data is an observation about reality, far from where you sit.far from where you sit.
You cannot go out there and verify each piece of You cannot go out there and verify each piece of data which you import.data which you import.
Even sampling is very difficult. Even sampling is very difficult.
You can only test the data against. . .You can only test the data against. . .
Absolute rules about behavior Absolute rules about behavior Reasonability tests to spot problems. Reasonability tests to spot problems.
© Copyright 2008 Neils Michael Scofield all rights reserved
228
““What can possibly go wrong?”What can possibly go wrong?”
On updates, your data supplier can…On updates, your data supplier can…
Stop populating a fieldStop populating a field
Filter out records for some reasonFilter out records for some reason
Redefine a code used by them internallyRedefine a code used by them internally
Re-use a field for a new meaningRe-use a field for a new meaning
Give you new data you didn’t expectGive you new data you didn’t expect
Change their source (and quality) of a given fieldChange their source (and quality) of a given field
© Copyright 2008 Neils Michael Scofield all rights reserved
229
Watch for .Watch for . . . . .
Unexpected changes in data architecture of sourceUnexpected changes in data architecture of source
New record types or segmentNew record types or segment Changes in cardinality between logical entities Changes in cardinality between logical entities New fields New fields Change in field length or usage Change in field length or usage
Unexpected changes in a field or columnUnexpected changes in a field or column Changes in domain of valid values Changes in domain of valid values Changes in numeric behavior (e.g. going negative) Changes in numeric behavior (e.g. going negative) Changes in null or “missing value” behavior Changes in null or “missing value” behavior
Semper vigilansSemper vigilans
© Copyright 2008 Neils Michael Scofield all rights reserved
230
Kinds of tests of imported dataKinds of tests of imported data
• Conformance to Conformance to absolute rulesabsolute rules e.g. Valid value tests, etc.e.g. Valid value tests, etc. Rules relative to…. Rules relative to…. 1. Expected values 1. Expected values 2. Own record 2. Own record 3. Other records 3. Other records
• Reasonability testing Reasonability testing Detecting anomalous behavior based on contextDetecting anomalous behavior based on context (e.g. “This doesn’t seem right!”) (e.g. “This doesn’t seem right!”)
Contexts and scope: Contexts and scope: 1. Own record 1. Own record 2. Own tape 2. Own tape 3. Prior tapes from this source 3. Prior tapes from this source 4. Whole database 4. Whole database
© Copyright 2008 Neils Michael Scofield all rights reserved
231
Flow of data through testsFlow of data through tests (ideal vision) (ideal vision)
Absolute checks
Reasonability tests
Reject (suspense) whole tapeReject (suspense) this record
Sound warning andhold tape (record) for further tests or explanation.
DatabaseContext
OK (so far)
New data
Data rules
temp Scrub
© Copyright 2008 Neils Michael Scofield all rights reserved
232
Test incoming data A.S.A.P.Test incoming data A.S.A.P.
Test the data as soon as you have the source available, … not when you are updating the database or DW.
Test the data before you do any scrubbing.
If you do scrub, apply original tests and other tests again after scrubbing.
© Copyright 2008 Neils Michael Scofield all rights reserved
233
Data ambiguity
Tabular structures seduce us into Tabular structures seduce us into thinking all cells have equal thinking all cells have equal reliability. reliability.
Not necessarily so!Not necessarily so!
data ambiguity lite
© Copyright 2008 Neils Michael Scofield all rights reserved
234
Summary & conclusionSummary & conclusion
Understand your users and their data needs.Understand your users and their data needs.
Understand the politics of your source. Understand the politics of your source.
Do they have reason to be guarded?Do they have reason to be guarded?
Understand the burden your request for data places upon Understand the burden your request for data places upon your source. your source.
Decide if you need one-time, Decide if you need one-time, or repeated updates.or repeated updates.
Manufacturing as share of total employment
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
1950 1960 1970 1980 1990 2000 2010
32.1 %
11.7 %
Share of consumption by category
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Mot
or ve
hicles
Furn
iture
& h
ouse
hold
Other
dur
able
Food
Clothi
ng &
sho
es
Gasoli
ne, f
uels
Other
non
-dur
able
Housin
g
House
hold
ope
ratio
n
Tran
spor
tatio
n
Med
ical c
are
Recre
ation
Other
serv
ices
1929
2001
© Copyright 2008 Neils Michael Scofield all rights reserved
235
Summary & conclusion Summary & conclusion (cont.)(cont.)
Before signing the agreement…Before signing the agreement…
Review the data documentationReview the data documentation
Understand the data architecture and paradigmUnderstand the data architecture and paradigm
Thoroughly test some sample dataThoroughly test some sample data
Be sure it conforms to your expectations…Be sure it conforms to your expectations…
format (case, etc.)format (case, etc.) scope scope unit of measure unit of measure quality and precision quality and precision
Try to break it!
© Copyright 2008 Neils Michael Scofield all rights reserved
236
The EndThe End…unless we keep going…unless we keep going
Michael ScofieldMichael Scofield
[email protected]@aol.com
“No vegetables were harmed in the making of this presentation.”