Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Real-Time Machine Learning at Industrial scale

9th October 2012Michael Cutler @cotdp

tumra.com@tumra

TUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA

... the battle of accuracy vs latency

$ whoami

Michael Cutler (@cotdp)

● Previously at British Sky Broadcasting○ Last 7 years in R&D○ Created several patented systems & algorithms○ Kicked off ‘Big Data’ initiative at Sky in 2008

● Co-founder CTO @ TUMRA in March '12○ Real-time big data science platform○ Alpha-testing with selected clients

Agenda

● Background● Real-Time vs Batch processing● Accuracy vs Latency● Use Cases

○ eCommerce○ Financial Services○ Media

● Questions

Background

Big Data is "in vogue", but what does it mean:

● Distributed processing● Massively scalable● Commodity

Apache Hadoop is "Kernel" of Big Data OS:

● Distributed Filesystem (HDFS)● Parallel Processing (Map/Reduce, YARN)

Background (cont'd)

Solving problems with Big Data is hard:

● Tools are all low-level (Pig, Hive etc.)● Skills are hard to find

What is "Data Science":

● Understanding data & solving problems● Applies the following skills:

○ Statistical Analysis○ Machine Learning○ Communicating Results

Real-Time vsBatch processing

Credit: http://bit.ly/Q71u4W

Batch - Hoppers, Bins, Buckets

Real-Time - Flows & Streams

Credit: http://bit.ly/NOslqf

Real-Time vs Batch processing

Similarities to the Industrial Revolution:

● From handicraft to Batch & Real-Time● Complexity increases

Need for "Real-Time":

● Wherever the variation can change faster than you can retrain models

● When you can't pre-compute everything ahead of time

Accuracy vs Latency

Accuracy vs Latency

Netflix Prize winning entry :-

● Ensemble of 100's of models● Massively compute intensive solution● Marginally better than much simpler models

IBM won the KDD Cup 2009 (Orange) :-

● IBM Watson team won by sheer brute force● Used a "one of everything" approach

generating hundreds of models

Accuracy vs Latency (cont'd)

Mathematical navel-gazing:

● Often the factor we're optimising for, isn't the thing we measure improvement in:○ User ratings vs. customer longevity/value○ Overfitting outliers vs. missing clear Fraud

Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.

However, that doesn't mean...

Accuracy vs Latency (cont'd)

It's a trade-off:

● Sometimes "best guess" is good enough,● Other times we can wait for the accuracy,● And of course, occasionally we want both!

Key objective:

● Most appropriate solution for the use-case● Hybrid solutions part batch, part real-time

Use CaseeCommerce

Use Case - eCommerce

Objective - Increase profits

How:● Match potential customers to the right products● Personalise user experience on web & email● Customer lifecycle management

Method:● Ensemble of real-time models● Collect lots of implicit feedback data

Use Case - eCommerce (cont'd)

Detail:● Clustering - behavior, demogs● Simple predictors - keywords to products● Bayesian Bandit - blend the output

Requirements:● Predictions in < 50 ms● Online learning models● Occasional batch updates are OK

When eCommerce #FAILs

I've only ever bought Cat food...

... wait there's more, no Cat food

Even Amazon can #FAIL

Use CaseFinancial Services

Use Case - Financial Services

Objective - Reduce Fraud

How:● Compute patterns/predictors for individuals● Cluster individuals and recompute for clusters● Compute baselines across all data

Method:● Hybrid and Hierarchical Clustering models● Simple predictors for individuals, clusters & baseline


Detail:● CHEAT!!! ... Cluster to nearest centroid

○ will degrade over time (Hunchback Clusters)● Use simple metrics to alert (stddev)

Requirements:● Ability to alert/intervene near real-time < 1 second● Adapt to rapid changes (within baseline & clusters)● Periodic batch processing to recompute clusters

Use CaseMedia

Use Case - Media

Objective - Generating Metadata

Why:● Drive second screen applications● Create new streams of information for resale

How:● Video / Audio analysis● Closed Caption or, Subtitle text processing● Knowledgebase :- People, Places, Products & Things

Use Case - Media (cont'd)

Method:● Natural Language Processing

○ Named Entity Recognition○ Topic Extraction & Disambiguation

● Graph databases & algorithms

Requirements:● Responses in < 1 second● Ability to learn new 'Things'

Example of 12,000 entities from our Knowledgebase...

Summary

Summary

Key points:● Clear move towards distributed algorithms● Latency is often more favorable than accuracy● Trade-offs are dependant on the use-cases

Further reading:● Apache Mahout - http://mahout.apache.org/● Storm Project - http://storm-project.net/● Data Science London - http://datasciencelondon.org/● Machine Learning Meetup - http://bit.ly/w8V8f6

Almost finished!

Introducing TUMRA Labs

API access to some of our real-time models:

● Probabilistic Demographics

Coming Soon:● Language detection● Sentiment analysis● Metadata Generation

Free to signup and easy to get started!

http://labs.tumra.com/

Questions?

Worktumra.com

@tumra

Personalcotdp.com

@cotdp

Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Technology

Transcript of Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)