From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality...

19
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Jaime Fitzgerald, President, Fitzgerald Analytics, Inc. Alex Hasha, Chief Data Scientist, Bundle.com May 1, 2012 Architects of Fact-Based Decisions™

description

Jaime G Fitzgerald, Fitzgerald Analytics Alex Hasha, Bundle.com (a joint venture between Citi, Microsoft Money, and Morningstar)

Transcript of From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality...

Page 1: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Jaime Fitzgerald, President, Fitzgerald Analytics, Inc. Alex Hasha, Chief Data Scientist, Bundle.com May 1, 2012

Architects of Fact-Based Decisions™

Page 2: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

2 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Agenda for Today’s Talk

1. The Business Model

2. The Text Analytics Challenge

3. How We Overcame the Challenge

4. Key Takeaways

5. Q&A

Page 3: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

3 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Introduction

Responsible For…

At a Company

That

Also Working

On

Jaime Fitzgerald, Founder @ Fitzgerald Analytics @JaimeFitzgerald

Transforming data into value for clients

Creating meaningful careers for employees

Helps clients convert Data to Dollars™ Brings a strategic perspective to improve

ROI on investments in technology, data, people, and processes

Working to Democratize Analytics by Reducing the “Barrier to Benefit” for non-profits, social entrepreneurs, and gov’t

Alex Hasha Data Scientist @ Bundle Corp @AlexHasha

Leading development of data products Designing statistical methods / algorithm

that transform data into insights for consumers

Uses data to help consumers make better decisions with their money

Bends valuable legacy data to new purposes

Is growing and hiring!

Learning about and implementing best

practices for managing complex data pipelines

Page 4: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

4 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

The Local Search Business

Page 5: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

5 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Gaps in Local Search Offerings

Paid Advertisement Not Trusted

User-Reviews Can be Biased

Selection Bias

Can be Gamed

Not Personalized

(to you)

Page 6: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

6 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Bundle’s Unique Contribution

Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households

Example: Credit Card Statement Data

Page 7: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

7 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

A Screen Shot From our Site

Page 8: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

8 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

A Screen Shot From our Site

Page 9: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

9 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

A Screen Shot From our Site

Page 10: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

10 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

We Do This with Billions of Real Spending Records

Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households

Example: Credit Card Statement Data

Key Issues with this Data: 1. Credit card data lacks

merchant identifier 2. So we rely on text analytics

to associate transactions with merchants

Page 11: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

11 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Pros Proprietary Differentiated Special Sauce

High Quality Clean / Verified

Crowd Sourced Up to the

Minute

Cons Semi-Structured Incomplete Lag / Recency

More variability in quality

Building our “Version of the Truth” from 3 sources

Localeze Factual Our

Transaction Data

Page 12: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

12 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Data: Not Useful Until Refined.

Page 13: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

13 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Key Steps in “Refinement” (Transformation)

Other Data: Census, Bureau of Labor Statistics, User Feedback

Card Transaction Data

Merchant Listings (e.g., Address, Phone

Number, Business Type)

Old Data Transformed in New Ways

To Create New Features Such As…

Data-Driven Reviews From an

Array of Customer Segments

People Who Shop Here Also Like…

The Bundle Loyalty Score

Linking

Normalization

Aggregation

Clustering

Page 14: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

14 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Before the Fun Stuff Happens…

Before we can generate insights about merchants for our users, we must associate each transaction in our database with a specific merchant from a master list….

Comprehensive Listing of US Merchants

(Tens of Millions – 107)

Text Matching

Naïve item by item search takes O(1016) expensive string comparisons: Too Slow!

Two main problems:

1. Accurate Fuzzy Matching is Difficult 2. Scale of Data is Enormous

Credit Card Transactions (Billions – 109)

• Highly variable text descriptions

• Noisy geographic info

• Noisy merchant category info

Page 15: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

15 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

A “Brute Force” Approach Would Never Work…

0

1

Hundreds Hundreds ofThousands

Tens of Millions

# of Merchants in Comparison Set

Pro

cess

ing

Tim

e /

Wo

rklo

ad

1. Matching w/in Hundreds of Millions of Merchants would require massive processing… ….Fortunately we don’t need to match at this level

2. Batching at local area, process orders of magnitude faster.

Neighborhood

City

Nation

Page 16: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

16 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Solution to Scaling Problem

This is a “Cascade of Scale Reductions”, Parallelizing by Location

Credit Card Transactions (Billions – 109)

Dedupe Description

Strings

Final Merged Transaction

Data Set

Secondary Fuzzy Matching Process Reconciles Preliminary

Listings with Merchant “Source of Truth” Text Clustering

(Not Matching) Consolidate Strings Belonging

to Same Merchant

Keys to solving the scaling problem:

1. Scale Reduction / Parallelized Text Clustering

2. Free Open Source Software

Preliminary Merchant Listing Generated Directly

from Transactions (Tens of Millions–107)

Batch Transactions by Geographic Neighborhood

1 2 10000

Computational Efficiency Increased by a Factor of 108!

Eons -> Days -> Minutes

Page 17: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

17 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Data Preparation: Phase 1

DAMA Lens Machine

Learning Lens

• Matching (Strings)

• Unsupervised Learning

• Text Clustering

• Pattern Discovery

Deduping X 10,

Cleansing

Anthonys Restaurant #123 Brkly NY

Anthony’s Restaurant

Example:

Page 18: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

18 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Data Preparation: Phase 2

DAMA Lens Machine

Learning Lens

• Record Linkage

• Data Quality Enhancement

• Information Retrieval

• Supervised Classifier

• Deduping + 30%

• More Cleansing

• Data Enrichment

Search Retrieves Top 10 Possible Matches

Classifier applied to each, returns confidence score

If Confidence = High, Records are linked

Page 19: From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

19 From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved

Takeaways

1. Tame your data before perfecting your methods. efficiency enables experimentation, iteration, improvement.

3. Tools: Take advantage of powerful (and inexpensive) open-source tools that enable your process...

2. Design your process to minimize unnecessary complexity (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)