Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
-
Upload
tumra-big-data-science-gain-a-competitive-advantage-through-big-data-data-science -
Category
Technology
-
view
10.710 -
download
0
description
Transcript of Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Real-Time Machine Learning at Industrial scale
9th October 2012Michael Cutler @cotdp
tumra.com@tumra
TUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA
... the battle of accuracy vs latency
$ whoami
Michael Cutler (@cotdp)
● Previously at British Sky Broadcasting○ Last 7 years in R&D○ Created several patented systems & algorithms○ Kicked off ‘Big Data’ initiative at Sky in 2008
● Co-founder CTO @ TUMRA in March '12○ Real-time big data science platform○ Alpha-testing with selected clients
Agenda
● Background● Real-Time vs Batch processing● Accuracy vs Latency● Use Cases
○ eCommerce○ Financial Services○ Media
● Questions
Background
Big Data is "in vogue", but what does it mean:
● Distributed processing● Massively scalable● Commodity
Apache Hadoop is "Kernel" of Big Data OS:
● Distributed Filesystem (HDFS)● Parallel Processing (Map/Reduce, YARN)
Background (cont'd)
Solving problems with Big Data is hard:
● Tools are all low-level (Pig, Hive etc.)● Skills are hard to find
What is "Data Science":
● Understanding data & solving problems● Applies the following skills:
○ Statistical Analysis○ Machine Learning○ Communicating Results
Real-Time vsBatch processing
Credit: http://bit.ly/Q71u4W
Batch - Hoppers, Bins, Buckets
Real-Time - Flows & Streams
Credit: http://bit.ly/NOslqf
Real-Time vs Batch processing
Similarities to the Industrial Revolution:
● From handicraft to Batch & Real-Time● Complexity increases
Need for "Real-Time":
● Wherever the variation can change faster than you can retrain models
● When you can't pre-compute everything ahead of time
Accuracy vs Latency
Accuracy vs Latency
Netflix Prize winning entry :-
● Ensemble of 100's of models● Massively compute intensive solution● Marginally better than much simpler models
IBM won the KDD Cup 2009 (Orange) :-
● IBM Watson team won by sheer brute force● Used a "one of everything" approach
generating hundreds of models
Accuracy vs Latency (cont'd)
Mathematical navel-gazing:
● Often the factor we're optimising for, isn't the thing we measure improvement in:○ User ratings vs. customer longevity/value○ Overfitting outliers vs. missing clear Fraud
Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.
However, that doesn't mean...
Accuracy vs Latency (cont'd)
It's a trade-off:
● Sometimes "best guess" is good enough,● Other times we can wait for the accuracy,● And of course, occasionally we want both!
Key objective:
● Most appropriate solution for the use-case● Hybrid solutions part batch, part real-time
Use CaseeCommerce
Use Case - eCommerce
Objective - Increase profits
How:● Match potential customers to the right products● Personalise user experience on web & email● Customer lifecycle management
Method:● Ensemble of real-time models● Collect lots of implicit feedback data
Use Case - eCommerce (cont'd)
Detail:● Clustering - behavior, demogs● Simple predictors - keywords to products● Bayesian Bandit - blend the output
Requirements:● Predictions in < 50 ms● Online learning models● Occasional batch updates are OK
When eCommerce #FAILs
I've only ever bought Cat food...
... wait there's more, no Cat food
Even Amazon can #FAIL
Use CaseFinancial Services
Use Case - Financial Services
Objective - Reduce Fraud
How:● Compute patterns/predictors for individuals● Cluster individuals and recompute for clusters● Compute baselines across all data
Method:● Hybrid and Hierarchical Clustering models● Simple predictors for individuals, clusters & baseline
Use Case - Financial Services
Detail:● CHEAT!!! ... Cluster to nearest centroid
○ will degrade over time (Hunchback Clusters)● Use simple metrics to alert (stddev)
Requirements:● Ability to alert/intervene near real-time < 1 second● Adapt to rapid changes (within baseline & clusters)● Periodic batch processing to recompute clusters
Use Case - Financial Services
Use CaseMedia
Use Case - Media
Objective - Generating Metadata
Why:● Drive second screen applications● Create new streams of information for resale
How:● Video / Audio analysis● Closed Caption or, Subtitle text processing● Knowledgebase :- People, Places, Products & Things
Use Case - Media (cont'd)
Method:● Natural Language Processing
○ Named Entity Recognition○ Topic Extraction & Disambiguation
● Graph databases & algorithms
Requirements:● Responses in < 1 second● Ability to learn new 'Things'
Example of 12,000 entities from our Knowledgebase...
Summary
Summary
Key points:● Clear move towards distributed algorithms● Latency is often more favorable than accuracy● Trade-offs are dependant on the use-cases
Further reading:● Apache Mahout - http://mahout.apache.org/● Storm Project - http://storm-project.net/● Data Science London - http://datasciencelondon.org/● Machine Learning Meetup - http://bit.ly/w8V8f6
Almost finished!
Introducing TUMRA Labs
API access to some of our real-time models:
● Probabilistic Demographics
Coming Soon:● Language detection● Sentiment analysis● Metadata Generation
Free to signup and easy to get started!
http://labs.tumra.com/
Questions?
Worktumra.com
@tumra
Personalcotdp.com
@cotdp