Guerrilla Analytics - Introduction and Case Study
-
Upload
enda-ridge -
Category
Data & Analytics
-
view
31 -
download
0
Transcript of Guerrilla Analytics - Introduction and Case Study
#GuerrillaAnalytics http://guerrilla-analytics.net 1
Guerrilla AnalyticsIntroduction and Case StudyEnda Ridge, PhD
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 2
What we are told about Data Science
“the sexy job in the next 10 years will be statisticians”
“Data Scientist: The Sexiest Job of the 21st Century”
“Information is the oil of the 21st century, and analytics is the combustion engine.”
http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 3
Hi, we need an update on the insurance policy classification work. It’s going to the Head of Underwriting this afternoon.
Um. Which work? I think Jo did that butJo’s on holidays.
I’ll check my mailbox and send you my spreadsheet from last week. Err.....the population changed with
the extra system extract on Tuesday.
And we added a bunch of business rules to accommodate that....
so we can’t go back to the earlier numbers.
The Reality
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 4Copyright Enda Ridge 2015
Those were the droids I was looking for ...
#GuerrillaAnalytics http://guerrilla-analytics.net 5
My Journey to Guerrilla Analytics
Mechanical Engineer
PhD Computer
Science
Boutique Consultancy
Forensic Data Analytics
Senior Manager
Copyright Enda Ridge 2015
Constraints Constraints+
DynamicReproducible
Constraints+
DynamicReproducible
+Tested
Constraints+
DynamicReproducible
+Tested
+Audit
#GuerrillaAnalytics http://guerrilla-analytics.net 6
Common format
Data Analytics Insight
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 7
Misconception
Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 8
Reality is Guerrilla Analytics
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 9
Maintain Data Provenance
Copyright Enda Ridge 2015
Maintaining Data Provenance mitigates disruptions
10
7 Principles of Guerrilla Analytics•S
pace is cheap, confusion is expensive
1
•Prefer simple, visual project structures
2
•Prefer automation with program code
3
•Link data on the file system, analytics environment, and work products
4
•Version control data and code
5
•Consolidate team knowledge in builds
6
•Prefer code that runs end to end
7Copyright Enda Ridge 2015 #GuerrillaAnalytics http://guerrilla-analytics.net
~100 practice
tips
#GuerrillaAnalytics http://guerrilla-analytics.net 11
Guerrilla Analytics
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 12
Guerrilla Analytics Case Study
Copyright Enda Ridge 2015
Client Retail Bank
Situation Error in credit card customer mailing processFailure to comply with regulations, potential fines
Mission • Understand system landscape & get the right data• Rebuild full customer history• Identify system errors and start of non-compliance• Quantify effected customers and cost to bank
Timeline 6-8 weeks
#GuerrillaAnalytics http://guerrilla-analytics.net 13
System Landscape
Customer Contact
Card System 1 Card System 2
Collections Manual Intervention
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 14
Data Receipt
Copyright Enda Ridge 2015
Guerrilla Analytics Environment
• Lost Data• Multiple Copies of data• Limited supporting information• Local copies of data• Renamed data• ä~ delimited data
#GuerrillaAnalytics http://guerrilla-analytics.net 15
Data Receipt
Copyright Enda Ridge 2015
Guerrilla Analytics Approach
• Have 1 Data location• Data Unique Identifiers• Data log• Supporting material near data• Never modify the data
#GuerrillaAnalytics http://guerrilla-analytics.net 16
Data Load
File System
Crazy-name spreadsheet 1Crazy-name spreadsheet 2Crazy-name spreadsheet 3
FNU810A
long_named_file_v0.2.1.pdf
Analytics Environment
Credit_Card_Samples
DBO.Accounts
Customer_Letters
Guerrilla Environment
• Renamed files• Scattered inconsistent
locations• Multiple versions of files• Replacements of files
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 17
Data Load
Data
Crazy-name spreadsheet 1
Crazy-name spreadsheet 2
FNU810A
long_named_file_v0.2.1.pdf
Analytics Environment
D010.Crazy-name spreadsheet 1
D026.Crazy-name spreadsheet 2
D040.FNU810A
D051.long_named_file_v0.2.1.pdf
Guerrilla Analytics Approach
• One-to-one mapping from files to datasets– Keep crazy names
• Minimize prep work• Put the Data Identifier in
the path
Copyright Enda Ridge 2015
D010
D026
D040
D051
#GuerrillaAnalytics http://guerrilla-analytics.net 18
Guerrilla Analytics
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 19
Analytics: Guerrilla Analytics Environment
Copyright Enda Ridge 2015
My Documents/Transactions
Accounts_Formatted.SQL
TransProf_FINAL.R
Trans_DO_NOT_USE.R
TransProf_v2.R
Sample_accounts.SQL
• Many code files/languages• Variety of output types• Data manipulation
– on file system– in analytics environment
• Combinations of tools• Many users• Many iterations
#GuerrillaAnalytics http://guerrilla-analytics.net 20
Analytics: Guerrilla Analytics Approach
Copyright Enda Ridge 2015
• One folder for all team work products
• Give every work product an identifier
• Keep a work product log• Clear running order of files• No dead/orphaned files
Work_Products
Work_products.xls
WP_024• 010_Accounts_Cleaned.SQL• 030_Transaction_Profiles.R• 050_Sample_accounts.SQL
WP_96
WP_97
#GuerrillaAnalytics http://guerrilla-analytics.net 21
Analytics: Guerrilla Analytics Approach
Copyright Enda Ridge 2015
• Keep older versions in subfolder• Keep related information in a
subfolder WP_024010_Accounts_Cleaned.SQL030_Transaction_Profiles.R050_Sample_accounts.SQL
supporting
archive
#GuerrillaAnalytics http://guerrilla-analytics.net 22
Analytics: Guerrilla Analytics Approach
File System
WP_024010_Accounts_Cleaned.SQL030_Transaction_Profiles.R050_Sample_accounts.SQL
Analytics Environment
WP_024.ACCOUNTS_CLEANED
WP_024.TRANSACTION_PROFILES
WP_024.SAMPLE_ACCOUNTS
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 23
Data Manipulation: Guerrilla EnvironmentAccount_ID Statement_ID Min_Payment Transaction_ID Amount Type
A 15 30.00 1 50.00 ExpenseA 15 30.00 2 25.00 ExpenseA 15 30.00 3 -75.00 PaymentA 15 30.00 4 20.00 Expense
Copyright Enda Ridge 2015
ID Stmnt_ID Min_Payment Balance Min_Paym_Made
A 15 30.00 20.00 No
... ... ...
#GuerrillaAnalytics http://guerrilla-analytics.net 24
Data Manipulation: Guerrilla Analytics Approach
Copyright Enda Ridge 2015
Account_ID Statement_ID Min_Payment
Transaction_ID
Amount Type RunningPayments
Min Paym Made
A 15 30.00 1 50.00 Expense 0.00 NoA 15 30.00 2 25.00 Expense 0.00 NoA 15 30.00 3 -75.00 Payment 75.00 YesA 15 30.00 4 20.00 Expense 75.00 Yes
Account_ID Statement_ID Min_Payment Transaction_ID Amount Type
A 15 30.00 1 50.00 ExpenseA 15 30.00 2 25.00 ExpenseA 15 30.00 3 -75.00 PaymentA 15 30.00 4 20.00 Expense
#GuerrillaAnalytics http://guerrilla-analytics.net 25
Guerrilla Analytics
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 26
Reporting – what is a report?
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 27
Reporting – Guerrilla Analytics Environment
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 28
Reporting – Guerrilla Analytics Environment
Copyright Enda Ridge 2015
Select min/max of transaction_time
WP_030
•010_Late payments.SQL•030_Late payments.py
WP_042
#GuerrillaAnalytics http://guerrilla-analytics.net 29
Guerrilla Analytics
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 30
Why consolidate?
Raw
Duplicates
Customers Clean_Cust
Deduped New_dupes
Work Product
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 31
Why consolidate?
Raw
Duplicates
Customers Clean_Cust
Deduped New_dupes
Duplicates_02
Customers_02
Duplicates
Deduped Clean_cust New_dupes
Work Product
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net 32
Guerrilla Analytics Approach: Builds
Deduped
Clean_cust
New_dupesDuplicates_02
Duplicates
Customers_02
Dupes_latest
Cust_Latest
Raw Latest Clean Rules Interface
Version Controlled Code and Data
Copyright Enda Ridge 2015
WP_030
33
Summary
Copyright Enda Ridge 2015 #GuerrillaAnalytics http://guerrilla-analytics.net
A Realistic Workflow• Guerrilla Analytics Principles• Guerrilla Analytics Practice Tips
Case Study• Data receipt and load• Analytics• Reporting and work products• Consolidation with Builds
Why Data Science is Difficult• Disruptions, Constraints• These break Data Provenance
Those were the droids I was looking for ...
#GuerrillaAnalytics http://guerrilla-analytics.net 34
Keep in Touch!
@Enda_Ridge
http://guerrilla-analytics.net
Copyright Enda Ridge 2015
Or contact me for 50% discount