2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research...
Transcript of 2014 International Data Quality Summit Data Quality ... · Quality Initiative for Research...
Data Quality Journey at IRS RAS –The Importance of Metadata
Robin Rappaport, CAP
Senior Operations Research AnalystData Quality and Metadata Team LeaderIAIDQ Webinar FacilitatorINFORMS Certified Analytics Professional(CAP) Exam Committee Member
IRS | Research, Analysis, and Statistics (RAS)
2014 International Data Quality Summit
2
About Our Speaker
• Data Quality and Metadata Team Leader responsible for delivery of DataQuality Initiative for Research Databases at Internal Revenue Service (IRS).Work of team contributed to IRS being awarded a The DataWarehousing Institute (TWDI) 2011 Best Practices Award for theCompliance Data Warehouse (CDW), a Computerworld Honor, and aGovernment Computer News (GCN) Gala Award.
• Over 25 years experience as Data Quality practitioner. Undergraduate degree inEconomics with Computer Science. Graduate work in Operations Research with concentration inMathematical Modeling in Information Systems. Worked in both private (6 years) and publicsectors (since 1990). Positions include Computer Programmer, Systems Analyst, andOperations Research Analyst.
• International Association for Information & Data Quality (IAIDQ) Webinar Facilitator;Member, Institute for Operations Research and Management Science (INFORMS), CertifiedAnalytics Professional (CAP) Exam Committee; Chairman, Individual Membership forWashington, D.C. chapter from 1987 – 1990; elected Secretary and served from 1990 - 1991.
• Specialties: Data Quality, Metadata, SAS, Analytics, Sybase SQL.
Data Quality JourneyThe Data Quality journey at the U.S. InternalRevenue Service (IRS) – Research,Analysis, and Statistics (RAS) began in 2005for the Compliance Data Warehouse (CDW)(started in 1997).
As the largest IRS database; CDW providesdata, metadata, tools, training and computingservices to hundreds of research analystsworking to improve tax administration.
3
4
Data Quality Initiative for IRS Research Databases
• January 2005, Temporary assignment to helpwith strategies and initiatives to improve thequality and delivery of data and informationservices to the Research community.
• March 2005, Discussion Paper by Director,Research Databases entitled: “Using theCompliance Data Warehouse (CDW) toImprove Data Quality for Research”.
• September 30, 2005, Permanent Position.
5
Proof of Data Quality Value
• Problem: Why CDW numbers differ fromanother data source used by Projections &Forecasting Group (PFG).
Response from IRS to Congress should be authoritative. Determined differences from CDW to PFG Data Source. Compared CDW numbers to CDW Data Source.
• Actual Problem: CDW numbers did not matchCDW data source. Only affected certain states.
• Resolution: Three missing tapes identified andloaded; Permanent position obtained.
6
Why a Data Quality Initiative for IRS Research Databases?
Data is fundamental to research.
Data Quality should be applied wherever data is analyzed.
Analysis requires data to be obtained.
Analysis requires assessment of quality of data.
Analysis requires knowledge of what the data represents.
Analysis requires knowledge of what fields should be selected.
Analysis requires knowledge of accuracy of the data.
Analysis requires knowledge of reliability of the data.
7
Why is Metadata Important for IRS Research Databases?
Data is fundamental to research.
Metadata is the key to turn data into information.
Research requires proper understanding of the data.
Research requires knowledge of what the datarepresents.
Research requires knowledge of what fields shouldbe selected.
8
Findings and conclusions only as good as the data
Raw Data
StaticReports
Ad hocReports
DescriptiveAnalysis
PredictiveModeling
Simulation &Optimization
What happened?What happened?
Why did it happen?Why did it happen?
What will happen?What will happen?
Ret
urn
on In
vest
men
t
Data and computing requirements
Data Analysis Ladder
9
Research only as good as understanding of the data
Raw Data
StaticReports
Ad hocReports
DescriptiveAnalysis
PredictiveModeling
Simulation &Optimization
What happened?What happened?
Why did it happen?Why did it happen?
What will happen?What will happen?
Ret
urn
on In
vest
men
t
Data and computing requirements
Data Analysis Ladder
10
Good research requires quality data and metadataIRS Strategic Foundations: Invest for High PerformanceUse data and research across organization to make informeddecisions and allocate resources RAS Goal: Become our customer’s preferred source. Centralized source of
timely, relevant, accurate, accessible, interpretable, and coherentData; Metadata; Tools; System Infrastructure; and Training.
Ongoing Effort: Improve quality of research databases; Increasenumber of users, and enhance online tools and web-enabled knowledge
Customer Comments: “…should win … award for making such usefuldata available to the research community and working … to ensuredata accuracy and consistency!”
Industry Recognition: The Data Warehousing Institute (TDWI) 2011 BestPractices, IAC Excellence.gov Award Finalist 2008, ComputerWorld Honor 2007,and Award from Government Computer News (GCN) 2007 for IRS’ Update ofCompliance Data Warehouse makes analysis less taxing
Government Recognition: IRS Enterprise Data Management Office, CanadaRevenue Agency, Department of Treasury, Pennsylvania Department of Taxation,National Security Agency (NSA), and Federal Aviation Administration (FAA)
Lessons Learned 1
• No one dimension is moreimportant than the others.
• Metadata is more important thanpeople realize.
• What it takes to be successfulwriting metadata for researchers.
11
12
CDW Aims to Address Data Quality in Six AreasArea Goal
Timeliness► Minimize amount of time to capture, move, and release data and metadata to
users.► Leverage new technologies or processes, where appropriate to increase
efficiencies throughout the data and metadata supply chain.
Relevance► Ensure data and metadata gaps are filled.► Make new investments in data and metadata that reflect expected future
priorities.
Accuracy► Create processes to routinely assess fitness of data and metadata.► Report quality assessment findings.► Publish statistical metadata through the CDW website.► Cross-validate release data against source and other system data.
Accessibility► Improve organization and delivery of metadata.► Provide online knowledge base.► Facilitate more efficient searching for metadata.► Invest in third-party tools to enhance access and analysis of data/metadata.
Interpretability► Standardize naming and typing conventions.► Create clear, concise, easy to understand data definitions (metadata).
Coherence► Develop common key structures to improve record matching across database
tables.► Ensure key fields have common names and data types (metadata).
Source 6 DQ Dimensions: Federal Committee on Survey Methodology; Gordon Brackstone, “Managing Data Quality in aStatistical Agency”, Survey Methodology, December 1999.
13
Become our Customer’s Preferred Source
Focus on Improvements
Frequency of Data Updates: Annual, Semi-Annual, Quarterly, Monthly,Weekly.
Data Augmentation: Derived fields, Pre-joining tables, Summary tables.
Data Profiling, Rule Development, Quality Assessment.
Website Redesign and Search Capabilities.
Metadata Development and Maintenance (Data Definitions, Lookup Tables,Column Profile).
Standard Key Structures, Naming Conventions, and Type Designations.
Researcher Quotes• It is frustrating to use data without
metadata.• How do I select the fields for my
analysis without knowing what theymean (Name, Format, No definition)?
• Please make sure to update dates afterloading data (Latest Update Cycle, Month,Year).
14
15
CDW Fact Sheet CDW
What is the Compliance Data Warehouse (CDW)?
What is CDW? It is anenvironment that provides data,tools, and computing servicesto the IRS Researchcommunity. It offers a high-performance analyticalenvironment for conductingresearch studies.
Why is CDW important?It is the preferred environmentfor over 1,000 researchanalysts and other businessusers to perform analytical andhigh volume data processing.Users include IRS, Treasury,GAO, and others.
Who can benefit fromCDW?Those who require access to acomprehensive set ofcompliance data for researchand analysis in a secureenvironment. Average. dailydatabase queries: 8,000.
The Compliance Data Warehouse (CDW) is an analytical data environment thatis specifically designed to support research activities in the IRS. It is managed andadministered by RAS. Key features of the CDW environment include:
Data: Taxpayer-level data from nearly 40 different legacy sources, including tax returns,customer accounts, information returns, case management systems, and third-partydata. With nearly 2 petabytes of total data, CDW is the largest database in the IRS.Data sources are released on a weekly, monthly, quarterly, and annual basis.
Metadata: Web-based metadata and dynamic data profiling for all databases, includingdefinitions, lookup tables, cross-references, and other artifacts for over 40,000 dataelements and over 1 million searchable attributes; Average daily web-based queries:1,500.
Tools: Software licenses for SQL clients, SAS, SAS Enterprise Miner/Text Miner, R,Hyperion Intelligence, ArcGIS, and support for any ODBC- or JDBC-compliantapplication.
Computing: Server-based computing environment for remote processing of complexand high-volume jobs; flexible storage management solutions for both temporary andpermanent user files;
Training and Support: In-house training for SQL and Hyperion; group rate solutionsfor SAS and ArcGIS; and general support for data, tools, and account managementservices;
Security: FISMA Certification & Accreditation, Online 5081 Authorization, TIN Masking,Database Audits, System Logging, and other Security Controls.
1616
CDW Data Road Map CDW
Major Categories
Customer Account
Tax Return
InformationReporting
Compliance andCase
Management
Third-Party
Major Categories of Data in CDW
Customer Account: Payments, abatements, credits, adjustments,freeze conditions, additional assessments, adjustments, reversals,bankruptcies, claims, penalties, offer in compromise, etc.
Tax Return: Federal tax returns filed by individuals, businesses, exemptorganizations, and government entities. Includes Forms 1040, 6251, 3520,1120, 1065, 1041, 990, and others.
Information Reporting: Information filed by a financial institution,employer, partnership, or other party on behalf of a taxpayer. IncludesForms W-2, 1098, 1099-B, 1099-MISC, Schedule K-1, and others.
Compliance/Case Management: Case management systemscontaining information on examinations, delinquent accounts, underreporteractivity, enforcement revenue, or other compliance-based data.
Third-Party: Taxpayer data from federal-state sharing agreements,treaty partners, and other federal agencies, including Social SecurityAdministration, Department of Justice, and State Department.
Other: Customer surveys, national statistical samples, credit bureau data,publically available data, fee-based financial data, and other sources.
Other
What types of data are available in CDW?
Customer Service
17
CDW: Infrastructure
Database Server(Sybase IQ)
Application Server(SAS, Hyperion)
Shared Storage(2 Petabyte)
IRS Network IRS Network
SPSS Access SQL Web SAS
1818
CDW Metadata CDW
How is CDW metadata created and published?
What is metadata?Metadata is sometimes defined
as “data about data”. It isideally maintained as arepository of searchableinformation about available data.
Why is metadataimportant?Metadata is the key to turn datainto information. It can beused as a tool to selectappropriate data for a study andthen understand and interpretfindings. It can also be used asa basis for developing businessrules for data validation andaugmentation.
Who can access CDWmetadata?Anyone with access to the IRSintranet can view CDW metadataon the CDW website.
Level Types of Metadata Attributes
Database Name, Source, Description, First Year, Last Year, Last Updated
Table Name, Database, Description, Frequency, Frequency Type,First Year, Last Year, Last Updated, Number of Columns
ColumnName, Legacy Name, Description, Table, First Year, Last Year,Last Updated, Data Type, Distribution Type, Range Type, NullsAllowed, Has Lookup, Minimum Length, Maximum Length,Primary Key, Last Updated, Legacy Source, Legacy File Name
CDW maintains a web-based repository of metadata for over 40,000columns of data. Metadata is available at the database, table, and columnlevel, and are created and updated based on some of the following sources: Internal Revenue Manual (IRM) Document 6209 (IRS Processing Codes and Information) Functional Specification Packages (FSPs), Computer Programming
Handbooks, Core Record Layouts (CRLs) , Program RequirementPackages (PRPs)
Tax Returns and Tax Return Instructions Other official documents and materials
Examples of CDW Metadata at the Database, Table, and Column Level
19
Standard Template for CDW Metadata Definition
• Reference to source
• Informative
• Clear
• Concise
• Complete
• Easy to understand
• Consistent
• Easier to develop
• Easier to maintain
The <legacy name> ....
Choose all that apply:
was added in <extract cycle>.(It) has data through <extract cycle>.(It) is <short description: clear, concise, and easy to
maintain>.
It is reported on <Form #, Line#>. (It is (transferredto OR included on <Form #, Line#> (notated'<notation>').) (The format is (<word for number>character(s) OR numeric). OR It is reported in
(positive, negative, or positive and negative) (wholedollars OR dollars and cents).) Valid values (if known)are .... Values are (if valid not known) ..... It is (zero,blank, null) (if not present OR if not applicable). (Values(other than valid) also appear.) (See <related fields>.)
Benefits
oNote: Basic template for form related fields. Other variations forindicators, codes, and computer generated fields.
20
SMEs Needed for Metadata Development: Value Statement
If you provide Subject Matter Experts (SMEs) formetadata development, you can expect betterunderstanding of data fields due to morecomplete definitions which should result inimproved and more reliable research.
If you do not, then you can expect continueddifficulty in identifying fields for researchwhich could result in flawed research andmisinformed decisions.
Developed at DGIQ 2010 Conference: Danette McGilvray’s tutorial.
* Presently looking for hard-workers capable of synthesizinginformation from various sources into clear, concise definitions.
These volunteers often volunteer others.
21
Skills Required to Write Metadata• Careful, attention to detail.• Innate curiosity.• Understand data through exploratory statistics and
Structured Query Language (SQL) programming orstatistical packages, such as SAS.
• Understand the benefit of using a standardtemplate (controlled vocabulary).
• Textual research skills (key word searches).• Knowledge of Service documentation or Subject
Matter Expertise (SME).• Ability to write in a clear, concise manner.
22
Loading Metadata to the CDW Website• SAS ® enables efficient delivery of large-scale
metadata to the IRS Research community as part of abroader data quality initiative.
• Column definitions and other attributes imported fromExcel spreadsheets, iteratively processed usingSAS ® macros, and exported to Microsoft SQLServer.
• Metadata in SQL Server managed through DATAsteps and procedures via the ODBC engine.
• Metadata published on CDW website in usableformat help IRS researchers quickly search for andunderstand meaning of data available for analysis.
23
CDW: Viewing Metadata
Microsoft SQLServer
IRS Network IRS Network
CDWWebsite
CDWWebsite
CDWWebsite
CDWWebsite
2424
CDW Metadata CDW
What metadata is available on the CDW website?
Database and Table-Level MetadataNames, descriptions, sources, availability, update status, and links toother internal websites for program or operational information.Metadata Availability
Database-Level
Table-Level
Column-Level
Lookup Tables
Reviews
2525
CDW Metadata CDW
What metadata is available on the CDW website (cont’d)?
Column-Level MetadataDefinitions, legacy references, availability, release frequency, datatypes, primary key candidates, Nulls, distribution type, range type, andother attributes.
Metadata Availability
Database-Level
Table-Level
Column-Level
Lookup Tables
Reviews
2626
CDW Metadata CDW
What metadata is available on the CDW website (cont’d)?
Lookup Tables and Column ReviewsDefinitions for unique values of discrete (categorical) fields, and abilityto view and submit comments about anomalies or other features.Metadata Availability
Database-Level
Table-Level
Column-Level
Lookup Tables
Reviews
2727
Standardized Search Results on the CDW Website
2828
CDW Data Profiling CDW
Identifying patterns (and problems) in data with profiling
What is Data Profiling?Data profiling involves astandardized analysis of data todetermine its completeness andsuitability for use.
Why is profiling important?Analyzing basic patterns in datacan reveal both insights andanomalies that are useful indescribing the underlying structureand improving data quality.
Who can profile CDW data?As of March 2010, those withaccess to the IRS intranet canprofile data via the CDW website.One does not need to be a CDWuser to view the CDW website andperform basic profiling.
Data profiling is a process of analyzing data to gauge overall suitabilityfor use. Common profiling tasks can help identify:
Invalid values in fields (values out of range) Missing values or empty fields (fields containing no data at all) Inconsistent methods of representing the same value Data elements used for purposes other than expected Violation of business rules Unrealistic frequencies or percentages of specific values in a column Violations of referential integrity Misspelled text values
The Role of Metadata in Data Profiling
Metadata includes valid values, when known. The process of dataprofiling is more informative when valid values are known. The actualvalues identified as part of the data profiling can be compared to thevalid values. This allows for Data Validation of a specific data fieldusing the data in that one field. Metadata can also enable cross-validation to what should be the same field on other data tables.Metadata can also include more complex business rules that can beused for validation.
2929
CDW Data Profiling CDW
What data profiling features are available on the CDW website?
Table Statistics
Row counts for a given table by State, County, and Zip Code. Rowcounts represent original, unaltered data from the authoritative source.
Data ProfilingFeatures
Table Statistics
FrequencyTable
ColumnStatistics
Trend Analysis
GeographicMaps
Reviews
3030
CDW Data Profiling CDW
What data profiling features are available on the CDW website (cont’d)?
Frequency TableFrequencies (row counts) for unique values of a discrete (categorical)field for a specific time period. Cumulative statistics are generated.
Data ProfilingFeatures
Table Statistics
FrequencyTable
ColumnStatistics
Trend Analysis
GeographicMaps
Reviews
3131
CDW Data Profiling CDW
What data profiling features are available on the CDW website (cont’d)?
Column StatisticsBasic distributional statistics for a given time period and filter condition.Users can drill down on unique values of a discrete (categorical) field.
Data ProfilingFeatures
Table Statistics
FrequencyTable
ColumnStatistics
Trend Analysis
GeographicMaps
Reviews
3232
CDW Data Profiling CDW
What data profiling features are available on the CDW website (cont’d)?
Trend AnalysisBasic statistics over time and by unique values of a discrete(categorical) field. Up to five years of data can be displayed.
Data ProfilingFeatures
Table Statistics
FrequencyTable
ColumnStatistics
Trend Analysis
GeographicMaps
Reviews
3333
CDW Data Profiling CDW
What data profiling features are available on the CDW website (cont’d)?
Geographic Maps: U.S., State, and County levelData Profiling
Features
Table Statistics
FrequencyTable
ColumnStatistics
Trend Analysis
GeographicMaps
Reviews
3434
CDW Data Profiling CDW
What data profiling features are available on the CDW website (cont’d)?
Data Reviews: Comments on specific data fields (columns) postedby usersData Profiling
Features
Table Statistics
FrequencyTable
ColumnStatistics
Trend Analysis
GeographicMaps
Reviews
3535
CDW Data Alerts CDW
What types of alerts are available on the CDW website?
Data Alert Types
New Metadata
UpdatedMetadata
DataAugmentation
Standardization
Accuracy Issues
DataCorrections
Lessons Learned 2• The more metadata we provide, the
greater the demand.• Expectations of up-to-date metadata
for all fields in CDW.• Metadata development and
maintenance is very time consumingand requires skilled resources.
• Better Metadata = Fewer Questions.36
37
CDW: Data Analysis Tools
Database Greater Than Data WarehouseDefinition
106 (1 MB) Tiny
109 (1 GB) Small
1010 (10 GB) Big
1011 (100 GB) Large
1012 (1 TB) Very Large
1013 (10 TB) Huge
1014 (100 TB) Massive1015 (1 PB) Ridiculous*
* Attributable to Ed Wegman, Center for Computational Statistics, George Mason University
Large data volume means choosing the right tool for the right job
The larger the amount of data being analyzed, the greater the efficiencies fromremote computing and using SQL or SQL-based products
CDW isin thisrange
38
Tools for Data Quality and Analysis
AnalysisTool Functional Characteristic
SQL► Easy-to-use database language consisting of a few basic commands that provides fast and
efficient retrieval, summarization, and processing of data► Virtually all ad-hoc or descriptive queries performed by Research users are conducive to SQL
Hyperion► Web-enabled, point-and-click tool for query, analysis, and report-writing activities► Ability to interactively create pivots, charts, and user-defined table views► Server-side implementation for maximum performance
SAS
► End-to-end data management and business intelligence software package supporting a widerange of functionality for data management, statistical analysis, econometrics and time series,operations research, GIS, data visualization, data mining, database connectivity, web andapplication development, and other business and analytical needs
► Server-side processing available to support analysis of very large databases► Most widely used all-purpose analysis software among federal statistical agencies► DataFlux product purchased to assist with Data Quality Assessment
SPSS► Data analysis tool widely used in the Research community that can be used to retrieve data
directly from CDW databases without any intermediate steps
Other ► Microsoft Access, Excel, and other third-party tools can be used for data retrieval, but are notsuited for processing data on the server
39
Quality AssessmentRule Failures Reported by Column
Rules # Errors # Rows Error RateRule 1: (Valid Values)
Rule 2: (Sign Test)
Rule 3: (Range Test)
Rule 4: (Legislative)
Rule n: (Others)
Column Name YEAR
* Rules Require Metadata to Write
40
Current Status and Future Plans
• Data Integration
• Data Profiling
• Rule Development
• Quality Assessment
• Continuous Monitoring
• Data Stewardship
Metadata is important for all of these activities
DQ Journey – Lessons LearnedAlong the Way
Without Metadata,Data Management toproperly serve a customercommunity of researchersis impossible.
41
42
Acknowledgements
• Jeff Butler, Director, Research Databases, IRSRAS
• CDW Data Quality Team (Headquarters, Field,and Contractors)
• Lwanga Yonke, Information Quality ProcessManager, Aera Energy LLC and Advisor toBoard of Directors, IAIDQ
43
Data Quality Journey at IRS RAS – TheImportance of Metadata
Robin Rappaport, CAPSenior Operations Research AnalystInternal Revenue Service (IRS)Research, Analysis, and Statistics (RAS)RAS:DM:RDD:TRD1111 Constitution Avenue, N.W. KWashington, D.C. 20001
2014 International Data Quality Summit