Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
-
Upload
xavier-amatriain -
Category
Technology
-
view
4.678 -
download
0
Transcript of Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain (@xamat)www.quora.com/profile/Xavier-Amatriain
3/29/16
A bit about
Our Mission
“To share and grow
the world’s knowledge”
• Millions of questions & answers
• Millions of users
• Thousands of topics
• ...
Demand
What we care about
Quality
Relevance
Lessons Learned
More Data vs. Better Models
More data or better models?
Really?
Anand Rajaraman: VC, Founder, Stanford Professor
More data or better models?
Sometimes, it’s not about more data
More data or better models?
Norvig: “Google does not have better Algorithms only more Data”
Many features/low-bias models
More data or better models?
Sometimes, it’s not about more data
Sometimes you do needA (more) Complex Model
Better models and features that “don’t work”
● E.g. You have a linear model and have been selecting and optimizing features for that model
■ More complex model with the same features -> improvement not likely
■ More expressive features with the same model -> improvement not likely
● More complex features may require a more complex model
● A more complex model may not show improvements with a feature set that is too simple
Model selection is also aboutHyperparameter optimization
Hyperparameter optimization
● Automate hyperparameter optimization by choosing the right metric.○ But, is it as simple as choosing the
max?
● Bayesian Optimization (Gaussian Processes) better than grid search○ See spearmint, hyperopt, AutoML,
MOE...
Supervised vs. plus Unsupervised Learning
Supervised/Unsupervised Learning
● Unsupervised learning as dimensionality reduction
● Unsupervised learning as feature engineering
● The “magic” behind combining
unsupervised/supervised learning
○ E.g.1 clustering + knn
○ E.g.2 Matrix Factorization■ MF can be interpreted as
● Unsupervised:
○ Dimensionality Reduction a la PCA
○ Clustering (e.g. NMF)
● Supervised
○ Labeled targets ~ regression
Supervised/Unsupervised Learning
● One of the “tricks” in Deep Learning is how it
combines unsupervised/supervised learning
○ E.g. Stacked Autoencoders
○ E.g. training of convolutional nets
Everything is an ensemble
Ensembles
● Netflix Prize was won by an ensemble
○ Initially Bellkor was using GDBTs
○ BigChaos introduced ANN-based ensemble
● Most practical applications of ML run an ensemble
○ Why wouldn’t you?
○ At least as good as the best of your methods
○ Can add completely different approaches (e.
g. CF and content-based)
○ You can use many different models at the
ensemble layer: LR, GDBTs, RFs, ANNs...
Ensembles & Feature Engineering
● Ensembles are the way to turn any model into a feature!
● E.g. Don’t know if the way to go is to use Factorization
Machines, Tensor Factorization, or RNNs?
○ Treat each model as a “feature”
○ Feed them into an ensemble
The Master Algorithm?
It definitely is the ensemble!
The pains & gains of Feature Engineering
Feature Engineering
● Main properties of a well-behaved ML feature
○ Reusable
○ Transformable
○ Interpretable
○ Reliable
● Reusability: You should be able to reuse features in different
models, applications, and teams
● Transformability: Besides directly reusing a feature, it
should be easy to use a transformation of it (e.g. log(f), max(f),
∑ft over a time window…)
Feature Engineering
● Main properties of a well-behaved ML feature
○ Reusable
○ Transformable
○ Interpretable
○ Reliable
● Interpretability: In order to do any of the previous, you
need to be able to understand the meaning of features and
interpret their values.
● Reliability: It should be easy to monitor and detect bugs/issues
in features
Feature Engineering Example - Quora Answer Ranking
What is a good Quora answer?
• truthful
• reusable
• provides explanation
• well formatted
• ...
Feature Engineering Example - Quora Answer Ranking
How are those dimensions translated
into features?
• Features that relate to the answer
quality itself
• Interaction features
(upvotes/downvotes, clicks,
comments…)
• User features (e.g. expertise in topic)
Implicit signals beat explicit ones
(almost always)
Implicit vs. Explicit
● Many have acknowledged
that implicit feedback is more useful
● Is implicit feedback really always
more useful?
● If so, why?
● Implicit data is (usually):
○ More dense, and available for all users
○ Better representative of user behavior vs.
user reflection
○ More related to final objective function
○ Better correlated with AB test results
● E.g. Rating vs watching
Implicit vs. Explicit
● However
○ It is not always the case that
direct implicit feedback correlates
well with long-term retention
○ E.g. clickbait
● Solution:
○ Combine different forms of
implicit + explicit to better represent
long-term goal
Implicit vs. Explicit
be thoughtful about your Training Data
Defining training/testing data
● Training a simple binary classifier for good/bad answer○ Defining positive and negative labels ->
Non-trivial task○ Is this a positive or a negative?
● funny uninformative answer with many upvotes● short uninformative answer by a well-known
expert in the field● very long informative answer that nobody
reads/upvotes● informative answer with grammar/spelling
mistakes● ...
Other training data issues: Time traveling
● Time traveling: usage of features that originated after the event you are trying to predict○ E.g. Your upvoting an answer is a pretty good prediction
of you reading that answer, especially because most upvotes happen AFTER you read the answer
○ Tricky when you have many related features○ Whenever I see an offline experiment with huge wins, I
ask: “Is there time traveling?”
Your Model will learn what you teach it to learn
Training a model
● Model will learn according to:
○ Training data (e.g. implicit and explicit)
○ Target function (e.g. probability of user reading an answer)
○ Metric (e.g. precision vs. recall)
● Example 1 (made up):
○ Optimize probability of a user going to the cinema to
watch a movie and rate it “highly” by using purchase history
and previous ratings. Use NDCG of the ranking as final
metric using only movies rated 4 or higher as positives.
Example 2 - Quora’s feed
● Training data = implicit + explicit
● Target function: Value of showing a story to a
user ~ weighted sum of actions: v = ∑a va 1{ya = 1}
○ predict probabilities for each action, then compute expected
value: v_pred = E[ V | x ] = ∑a va p(a | x)
● Metric: any ranking metric
Offline testing
● Measure model performance, using (IR) metrics
● Offline performance = indication to make decisions on follow-up A/B tests
● A critical (and mostly unsolved) issue is how offline metrics correlate with A/B test results.
Learn to deal with Presentation Bias
2D Navigational modeling
More likely to see
Less likely
The curse of presentation bias
● User can only click on what you decide to show● But, what you decide to show is the result of what your model
predicted is good● Simply treating things you show as negatives is not likely to work● Better options
● Correcting for the probability a user will click on a position -> Attention models
● Explore/exploit approaches such as MAB
You don’t need to distribute your ML algorithm
Distributing ML
● Most of what people do in practice can fit into a multi-
core machine
○ Smart data sampling
○ Offline schemes
○ Efficient parallel code
● Dangers of “easy” distributed approaches such
as Hadoop/Spark
● Do you care about costs? How about latencies?
Distributing ML
● Example of optimizing computations to fit them into
one machine
○ Spark implementation: 6 hours, 15 machines
○ Developer time: 4 days
○ C++ implementation: 10 minutes, 1 machine
● Most practical applications of Big Data can fit into
a (multicore) implementation
The untold story of Data Science and vs. ML engineering
Data Scientists and ML Engineers
● We all know the definition of a Data Scientist
● Where do Data Scientists fit in an organization?
○ Many companies struggling with this
● Valuable to have strong DS who can bring value
from the data
● Strong DS with solid engineering skills are
unicorns and finding them is not scalable○ DS need engineers to bring things to production
○ Engineers have enough on their plate to be willing to
“productionize” cool DS projects
The data-driven ML innovation funnel
Data Research
ML Exploration - Product Design
AB Testing
Data Scientists and ML Engineers
● Solution:
○ (1) Define different parts of the innovation funnel
■ Part 1. Data research & hypothesis
building -> Data Science
■ Part 2. ML solution building &
implementation -> ML Engineering
■ Part 3. Online experimentation, AB
Testing analysis-> Data Science
○ (2) Broaden the definition of ML Engineers
to include from coding experts with high-level
ML knowledge to ML experts with good
software skills
Data Research
ML Solution
AB Testing
Data
ScienceD
ata Science
ML
Engineering
Conclusions
● In data, size is not all that matters● Understand dependencies between data, models
& systems● Choose the right metric & optimize what matters● Be thoughtful about
○ your ML infrastructure/tools
○ about organizing your teams
Questions?