Become a Big Data Quality Hero
-
Upload
techwellpresentations -
Category
Technology
-
view
99 -
download
1
description
Transcript of Become a Big Data Quality Hero
T8 Concurrent Class
10/3/2013 11:15:00 AM
"Become a Big Data Quality
Hero"
Presented by:
Jason Rauen
LexisNexis
Brought to you by:
340 Corporate Way, Suite 300, Orange Park, FL 32073
888-268-8770 ∙ 904-278-0524 ∙ [email protected] ∙ www.sqe.com
Jason Rauen
LexisNexis
Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk Solutions. With
more than fifteen years of experience, Jason has led the data testing team in big data from its
inception. He has presented big data scripting techniques at HPCC Systems national Data
Summit. His background includes working at companies including Microsoft, AT&T, and
LexisNexis, and instructing at Intel, Boeing, Executrain, and the Department of the Navy.
9/19/2013
1
“Quality isn’t measured by how many clients you
obtain; it’s measured by how many clients you
retain.”
“QA isn’t the bottom of the totem pole; it’s the dirt
holding it up.”
Interesting Quotes……
Become a Big Data Quality Hero
A look inside QA for Big DataPresented by 01001010 01100001 01110011 01101111 01101110 00100000
01010010 01100001 01110101 01100101 01101110 (Jason Rauen)
9/19/2013
2
Overview
• Why Test and How it’s Different– Issues
– Benefits
• Architecture and why you need to know– HPCC Systems/Hadoop
– Know Your Data/Environment
• Strategies and Concepts–What to look for
– Sample Gathering (AUB)
– Stats
– Profiling
Why Test and How it’s Different
Why Test Data:
• Traditional methods not adequate – Traditional sampling
needs improvement and is scenario based, not enough
samples, human error, etc….
• Tied into current environment
• Government regulatory compliances
• Auditing requirements
• Company wide initiatives
9/19/2013
3
Why Test and How it’s Different
Want to keep your customers?
Why Test and How it’s Different
• When?
o Testing - SDLC
o Routine Testing
o Frequency - Yearly/Monthly/Weekly/Daily/Hourly/On
Demand
• What? Types Testing
� New Project – Source to Target (Transform)
� Standard - Production Validation
� Emergency releases
• How?
o Using what you have available
o Freebies – Profiling tools, etc…
9/19/2013
4
Why Test and How it’s Different
Issues:
• Lack of control
Timing of builds
Samples and location of samples
• 3rd Party Apps
Lack of licenses, Costs, Training, and existing
knowledge
• Extra hardware
• Upgrades
Why Test and How it’s Different
Benefits:
• Cost savings
• Better Coverage
No Samples
Increased Sampling
Focused Samples
• Faster (Time is $)
• Quicker to Diagnosing issues
• Better Data Integrity
• Collaboration with other groups
9/19/2013
5
Architecture and why you need to know
Typical Generic Architecture
input DB
Architecture and why you need to know
Data Fabrication Engines
• HDFS Hadoop and HPCC THOR
• Made of several nodes
• Where the ETL happens
• Where the Keys are made
Data Delivery Engines
• HPCC ROXIE, HBASE, etc…
• Keys moved to and referenced here
• Queries reside
9/19/2013
6
Architecture and why you need to know
Architecture and why you need to know
HDFS
Hadoop MapreduceHBASE
9/19/2013
7
Architecture and why you need to know
Architecture and why you need to know
HDFS
Map Shuffle Reduce
9/19/2013
8
Architecture and why you need to know
DISTRIBUTE/PROJECT/TRANSFORM Rollup
HPCC Systems
Strategies and Concepts
• What to look for……
� Brand New, Incomplete, or Missing Builds (Data Cops)
� Data progression Today/Yesterday FatherKey/Grandfatherkey
� Count of Deltas in release/deploy
� Keys updated
� Missing keys/New keys
� Field Validations Indexed and Non Indexed
� Key Layout issues
� Corruption unprintable or invalid characters
� Duplicate records of new and existing records
� Data Fabrication Engine to Data delivery Engine deploys/sync
� Queries with new data
9/19/2013
9
Strategies and Concepts
JOIN
• Sample gathering
• New Key for testing
• Deployment Validation
- Data Fabrication
• Deployment Validation
- Data Delivery
And get a free cookie…
Strategies and Concepts
AUB for JOIN
A = Left key (New)
B = Right key (Old)Types of JOINS
Inner Join Left Outer Join Right Outer Join
Full Outer Join Minus or Left Only
9/19/2013
10
Strategies and Concepts
AUB for JOIN
A = Left key (New)
B = Right key (Old)
VENN
Strategies and Concepts
Statistics: What you try to remember with this swimming
behind you.
9/19/2013
11
Strategies and Concepts
Statistics:
• On data sets and keys
- Gives you a high level look at the release
- Ranges
- You’ll start to notice a trend line
• On Releases
- Done over time you’ll see the trend of new data sets and keys
- Done over time you’ll see the trend of changed or modified
data sets and keys
Strategies and Concepts
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
RELEASE NUMBERS
AVERAG 175.4
CEILING 210.6
FLOOR 135.1
9/19/2013
12
Strategies and Concepts
Data Profiling:
• Data Profiling Summary Report
• Data Profiling Field Detail Report
� http://www.hpccsystems.com/demos/data-
profiling-demo
• Data Profiling Field Combination Report
Strategies and Concepts
Data Profiling Summary Report
9/19/2013
13
Strategies and Concepts
Data Profiling Field Detail Report
Strategies and Concepts
Data Profiling Field Combination Report
9/19/2013
14
Strategies and Concepts
SQL
SELECT * FROM Products;
SELECT * FROM Products
WHERE productcode =
‘R2D2C3PO’;
SELECT COUNT (*) FROM
Products;
Pig
DUMP Products;
Products= FILTER
Products BY productcode
= ‘R2D2C3PO’;
DUMP Products;
Products= GROUP
Products ALL;
Products =FOREACH
Products GENERATE
COUNT (Products);
DUMP Products;
ECL
Products;
Products(productcode =
‘R2D2C3PO’);
COUNT(Products);
Strategies and Concepts
SQL
SELECT * FROM Products
ORDER BY productcode;
SELECT * FROM Products FULL
OUTER JOIN OtherProducts
ON Products.col1 =
OtherProducts.col1;
Pig
Products= ORDER
Products BY productcode;
DUMP Products;
Products= JOIN Products
BY col1 FULL OUTER,
OtherProducts BY col1;
DUMP Products;
ECL
SORT(
Products,productcode);
JOIN(Products,OtherPro
ducts, LEFT.col1 =
RIGHT.col1,FULL
OUTER);
9/19/2013
15
Summary
� Why Test and How it’s Different
� Architecture and why you need to know
� Strategies and Concepts
Questions?
9/19/2013
16
Contact / Useful links
www.linkedin/in/jasonrauen
• HPCC Systems/ECL Links:http://hpccsystems.com
http://hpccsystems.com/demos
• Hadoop/Pig Latin Links:http://pig.apache.org
http://hadoop.apache.org
• SQL Links:http://sql.org/
http://msdn.microsoft.com/en-US/sqlserver/default.aspx