From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality...
-
Upload
fitzgerald-analytics-inc -
Category
Economy & Finance
-
view
3.164 -
download
3
description
Transcript of From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality...
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Jaime Fitzgerald, President, Fitzgerald Analytics, Inc. Alex Hasha, Chief Data Scientist, Bundle.com May 1, 2012
Architects of Fact-Based Decisions™
2 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Agenda for Today’s Talk
1. The Business Model
2. The Text Analytics Challenge
3. How We Overcame the Challenge
4. Key Takeaways
5. Q&A
3 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Introduction
Responsible For…
At a Company
That
Also Working
On
Jaime Fitzgerald, Founder @ Fitzgerald Analytics @JaimeFitzgerald
Transforming data into value for clients
Creating meaningful careers for employees
Helps clients convert Data to Dollars™ Brings a strategic perspective to improve
ROI on investments in technology, data, people, and processes
Working to Democratize Analytics by Reducing the “Barrier to Benefit” for non-profits, social entrepreneurs, and gov’t
Alex Hasha Data Scientist @ Bundle Corp @AlexHasha
Leading development of data products Designing statistical methods / algorithm
that transform data into insights for consumers
Uses data to help consumers make better decisions with their money
Bends valuable legacy data to new purposes
Is growing and hiring!
Learning about and implementing best
practices for managing complex data pipelines
4 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
The Local Search Business
5 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Gaps in Local Search Offerings
Paid Advertisement Not Trusted
User-Reviews Can be Biased
Selection Bias
Can be Gamed
Not Personalized
(to you)
6 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Bundle’s Unique Contribution
Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households
Example: Credit Card Statement Data
7 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
A Screen Shot From our Site
8 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
A Screen Shot From our Site
9 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
A Screen Shot From our Site
10 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
We Do This with Billions of Real Spending Records
Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households
Example: Credit Card Statement Data
Key Issues with this Data: 1. Credit card data lacks
merchant identifier 2. So we rely on text analytics
to associate transactions with merchants
11 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Pros Proprietary Differentiated Special Sauce
High Quality Clean / Verified
Crowd Sourced Up to the
Minute
Cons Semi-Structured Incomplete Lag / Recency
More variability in quality
Building our “Version of the Truth” from 3 sources
Localeze Factual Our
Transaction Data
12 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Data: Not Useful Until Refined.
13 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Key Steps in “Refinement” (Transformation)
Other Data: Census, Bureau of Labor Statistics, User Feedback
Card Transaction Data
Merchant Listings (e.g., Address, Phone
Number, Business Type)
Old Data Transformed in New Ways
To Create New Features Such As…
Data-Driven Reviews From an
Array of Customer Segments
People Who Shop Here Also Like…
The Bundle Loyalty Score
Linking
Normalization
Aggregation
Clustering
14 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Before the Fun Stuff Happens…
Before we can generate insights about merchants for our users, we must associate each transaction in our database with a specific merchant from a master list….
Comprehensive Listing of US Merchants
(Tens of Millions – 107)
Text Matching
Naïve item by item search takes O(1016) expensive string comparisons: Too Slow!
Two main problems:
1. Accurate Fuzzy Matching is Difficult 2. Scale of Data is Enormous
Credit Card Transactions (Billions – 109)
• Highly variable text descriptions
• Noisy geographic info
• Noisy merchant category info
15 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
A “Brute Force” Approach Would Never Work…
0
1
Hundreds Hundreds ofThousands
Tens of Millions
# of Merchants in Comparison Set
Pro
cess
ing
Tim
e /
Wo
rklo
ad
1. Matching w/in Hundreds of Millions of Merchants would require massive processing… ….Fortunately we don’t need to match at this level
2. Batching at local area, process orders of magnitude faster.
Neighborhood
City
Nation
16 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Solution to Scaling Problem
This is a “Cascade of Scale Reductions”, Parallelizing by Location
Credit Card Transactions (Billions – 109)
Dedupe Description
Strings
Final Merged Transaction
Data Set
Secondary Fuzzy Matching Process Reconciles Preliminary
Listings with Merchant “Source of Truth” Text Clustering
(Not Matching) Consolidate Strings Belonging
to Same Merchant
Keys to solving the scaling problem:
1. Scale Reduction / Parallelized Text Clustering
2. Free Open Source Software
Preliminary Merchant Listing Generated Directly
from Transactions (Tens of Millions–107)
Batch Transactions by Geographic Neighborhood
1 2 10000
Computational Efficiency Increased by a Factor of 108!
Eons -> Days -> Minutes
17 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Data Preparation: Phase 1
DAMA Lens Machine
Learning Lens
• Matching (Strings)
• Unsupervised Learning
• Text Clustering
• Pattern Discovery
Deduping X 10,
Cleansing
Anthonys Restaurant #123 Brkly NY
Anthony’s Restaurant
Example:
18 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Data Preparation: Phase 2
DAMA Lens Machine
Learning Lens
• Record Linkage
• Data Quality Enhancement
• Information Retrieval
• Supervised Classifier
• Deduping + 30%
• More Cleansing
• Data Enrichment
Search Retrieves Top 10 Possible Matches
Classifier applied to each, returns confidence score
If Confidence = High, Records are linked
19 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved
Takeaways
1. Tame your data before perfecting your methods. efficiency enables experimentation, iteration, improvement.
3. Tools: Take advantage of powerful (and inexpensive) open-source tools that enable your process...
2. Design your process to minimize unnecessary complexity (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)