Data Integrity Verification
Michael Kano, ACDA
1Data Integrity Verification
IIA Orange County Chapter
November 13, 2015
Excel Transformed My Data!
Data Integrity Verification2
BEFORE AFTER
101122001XIOB00260002 10112210000000
3
Michael KanoSenior Manager, Data Analytics
Sunera LLC
Michael is a Senior Manager with Sunera’s national data analytics practice. Michael has 20 years of experience in data analytics and internal audit with organizations in the USA, Canada, and Kuwait.
He has 20 years of experience with ACL software, including 8 years as the leader of ACL Services Ltd.’s global training team. During his tenure at ACL Services, Michael helped drive the training business to new levels of revenues and profits by actively supporting the Sales team in pre-sales discussions.
Michael’s most recent experience consists of four years with eBay, Inc.’s internal audit team as Manager, Audit Analysis. He was tasked with integrating data analytics into the audit workflow on strategic and tactical levels. This included developing quality and documentation standards, training users, and providing analytics support on numerous audits in the IT, PayPal, and eBay marketplaces business areas. He also provided support to non-IA teams such as the Business Ethics Office and Enterprise Risk Management teams.
During his years at eBay, Michael supported audits throughout the organization in the IT, compliance, operations, vendor management, revenue assurance, T&E, and human resources areas.
Michael also has 7 years of experience with Arbutus Software, and has managed the transition to Arbutus from other data analysis tools. He is a proficient user of Tableau, Microsoft Access, and Teradata SQL Assistant.
AGENDA
� Defining data integrity verification (DIV)
� Sources of integrity erosion
� File-level testing
� Field-level testing
Data Integrity Verification4
Defining Data Integrity Verification
5Data Integrity Verification
Data Integrity Verification (DIV)
• The process by which the data analyst tests
the data to determine whether it is acceptable
for analysis
• Tests should be carried out at both the file
level and the field level before conducting any
analytics.
Data Integrity Verification6
The Risks of Integrity Erosion
• Lost time
• Incorrect conclusions
• Revenue/cost
• Security
• Professional standing
7Data Integrity Verification
Evidence of data integrity erosion
• Missing records
• Excess records
• Duplicates
• Shifted fields
• Skewed records
Data Integrity Verification8
• Blank/invalid entries
in key fields
• Incorrect/invalid
formatting
• Invalid characters in
data
Shifted Fields
Data Integrity Verification9
Skewed Records
Data Integrity Verification10
Sources of Integrity Erosion
11Data Integrity Verification
Processing…
Data Integrity Verification12
The Process
13Data Integrity Verification
Sources of data integrity errors
• Miscommunication of requirements
• Extraction
• Conversion
• Transmission
• Import
• Manual edits
• Data definition
Data Integrity Verification14
Miscommunication
• "All AP transactions between April and June,
including all important fields."
• "All AP payments and reversals between
4/1/2015 and 6/30/2015 (inclusive) including
the following fields: <field list>. The output
should be in a tab-delimited text file, and at
no point should it pass through a spreadsheet
or be opened in a spreadsheet application."
15Data Integrity Verification
Conversion
• Dropping leading zeros (ID numbers)
• Converting date to numeric
• Removing alphas from alphanumeric field
• Use of delimiter that is included within a text
field
• Insertion of blank lines in Excel
Data Integrity Verification16
Date Conversion
Data Integrity Verification17
Manual Edits
• Inadvertent/deliberate editing
• How does that happen?
– Sorting
– Formatting
– Copy/pasting
18Data Integrity Verification
Data Definition
• Record length
• Field position
• Formatting (date fields)
Data Integrity Verification19
File-Level Testing
20Data Integrity Verification
File-Level Testing
• Structure
• Content
Data Integrity Verification21
Structure
• Review metadata
• Send table layout to a table in Arbutus/ACL
• Compare field type/length/format to
metadata
22Data Integrity Verification
Content
• Completeness
– Run COUNT to document number of records
– Run TOTAL on numeric fields for control totals
• Uniqueness: Run DUPLICATES command
selecting all fields to identify duplicate
records
• Validity: Run VERIFY against numeric and
date fields
Data Integrity Verification23
Field-Level Testing: Numerics
24Data Integrity Verification
Numeric Fields: What to look for
Data Integrity Verification25
• Field total • Lowest value
•Highest value •Average
•Second-highest value •Range
•Ratio of 2nd highest to highest •Absolute value
•Median •Number of zeros
•Number of positives •Number of negatives
•Number of corrupt entries
Testing Numeric Fields
• Run STATISTICS against all numeric fields
– Look for zeros, negatives, bounds,
highest/second-highest
• Recalculate computed value with computed
fields (e.g, Total_Amount = Price * Quantity)
Data Integrity Verification26
Scripted Solution
Data Integrity Verification27
•Shows table/field names, and test date-time in a table
•Provides comprehensive, standard test results
•Faster and less error-prone than manual execution
•2 million records, 4 numeric fields in ~45 seconds
•Also saves table layout for file with _TL suffix
Script Results: Numerics
28Data Integrity Verification
Field-Level Testing: Dates
29Data Integrity Verification
Date Fields: What to look for
Data Integrity Verification30
•Oldest •Weekends
•Most recent •Blanks
•Span of valid dates •Invalid non-blank dates
Testing Date Fields
• Run STATISTICS against all date fields
– Blanks/invalids/weekends
– Bounds
• Test related fields, e.g., PO_Date <=
Invoice_Date
• Test for completeness (24/7 data) with GAPS
command
Data Integrity Verification31
Blank Dates & Formatting
• Entire date column is blank = Incorrect format
in field definition.
• Edit >> Table Layout to review and correct
format
Data Integrity Verification32
Formatting Date Fields
Data Integrity Verification33
Dates: Scripted Solution
34Data Integrity Verification
Field-Level Testing: Characters
35Data Integrity Verification
Character Fields: What to look for
Data Integrity Verification36
Item Functionality
Blanks ISBLANK(<key>)
Invalid entries CLASSIFY ON <key>
CLASSIFY ON FORMAT(<key>)
Duplicates DUPLICATES ON <key>
Character Fields: Formats
• Verify that format is valid
• May need to scrub
• PO numbers, customer IDs, phone numbers,
zip codes
• Use FORMAT() function in CLASSIFY to
display list of unique formats
CLASSIFY ON FORMAT(<field name>) TO "<output file>" OPEN
Data Integrity Verification37
Output of CLASSIFY + Format()
Data Integrity Verification38
•1 record per format
•Shows frequency
x= lower-case alpha
X = upper-case alpha
9 = numeric
Blanks/special characters
Mitigating Integrity Risk
39Data Integrity Verification
Key Items
• Know your data
• Obtain data independently (SQL?)
• Short chain from extraction to analysis
• Automated DIV
40Data Integrity Verification
The Process
41Data Integrity Verification
The New Process
42Data Integrity Verification
Benefits
• Independence
• Confidence
• Shorter time
• Comprehensive DIV
Data Integrity Verification43
Top Related