Strata 2016 - Lessons Learned from building real-life Machine Learning Systems

Lessons Learned from building real-life Machine Learning Systems

Xavier Amatriain (@xamat)www.quora.com/profile/Xavier-Amatriain

3/29/16

A bit about

Our Mission

“To share and grow

the world’s knowledge”

• Millions of questions & answers

• Millions of users

• Thousands of topics

• ...

Demand

What we care about

Quality

Relevance

Lessons Learned

More Data vs. Better Models

More data or better models?

Really?

Anand Rajaraman: VC, Founder, Stanford Professor


Sometimes, it’s not about more data


Norvig: “Google does not have better Algorithms only more Data”

Many features/low-bias models


Sometimes, it’s not about more data

Sometimes you do needA (more) Complex Model

Better models and features that “don’t work”

● E.g. You have a linear model and have been selecting and optimizing features for that model

■ More complex model with the same features -> improvement not likely

■ More expressive features with the same model -> improvement not likely

● More complex features may require a more complex model

● A more complex model may not show improvements with a feature set that is too simple

Model selection is also aboutHyperparameter optimization

Hyperparameter optimization

● Automate hyperparameter optimization by choosing the right metric.○ But, is it as simple as choosing the

max?

● Bayesian Optimization (Gaussian Processes) better than grid search○ See spearmint, hyperopt, AutoML,

MOE...

Supervised vs. plus Unsupervised Learning

Supervised/Unsupervised Learning

● Unsupervised learning as dimensionality reduction

● Unsupervised learning as feature engineering

● The “magic” behind combining

unsupervised/supervised learning

○ E.g.1 clustering + knn

○ E.g.2 Matrix Factorization■ MF can be interpreted as

● Unsupervised:

○ Dimensionality Reduction a la PCA

○ Clustering (e.g. NMF)

● Supervised

○ Labeled targets ~ regression

Supervised/Unsupervised Learning

● One of the “tricks” in Deep Learning is how it

combines unsupervised/supervised learning

○ E.g. Stacked Autoencoders

○ E.g. training of convolutional nets

Everything is an ensemble

Ensembles

● Netflix Prize was won by an ensemble

○ Initially Bellkor was using GDBTs

○ BigChaos introduced ANN-based ensemble

● Most practical applications of ML run an ensemble

○ Why wouldn’t you?

○ At least as good as the best of your methods

○ Can add completely different approaches (e.

g. CF and content-based)

○ You can use many different models at the

ensemble layer: LR, GDBTs, RFs, ANNs...

Ensembles & Feature Engineering

● Ensembles are the way to turn any model into a feature!

● E.g. Don’t know if the way to go is to use Factorization

Machines, Tensor Factorization, or RNNs?

○ Treat each model as a “feature”

○ Feed them into an ensemble

The Master Algorithm?

It definitely is the ensemble!

The pains & gains of Feature Engineering

Feature Engineering

● Main properties of a well-behaved ML feature

○ Reusable

○ Transformable

○ Interpretable

○ Reliable

● Reusability: You should be able to reuse features in different

models, applications, and teams

● Transformability: Besides directly reusing a feature, it

should be easy to use a transformation of it (e.g. log(f), max(f),

∑ft over a time window…)

Feature Engineering

● Main properties of a well-behaved ML feature

○ Reusable

○ Transformable

○ Interpretable

○ Reliable

● Interpretability: In order to do any of the previous, you

need to be able to understand the meaning of features and

interpret their values.

● Reliability: It should be easy to monitor and detect bugs/issues

in features

Feature Engineering Example - Quora Answer Ranking

What is a good Quora answer?

• truthful

• reusable

• provides explanation

• well formatted

• ...

Feature Engineering Example - Quora Answer Ranking

How are those dimensions translated

into features?

• Features that relate to the answer

quality itself

• Interaction features

(upvotes/downvotes, clicks,

comments…)

• User features (e.g. expertise in topic)

Implicit signals beat explicit ones

(almost always)

Implicit vs. Explicit

● Many have acknowledged

that implicit feedback is more useful

● Is implicit feedback really always

more useful?

● If so, why?

● Implicit data is (usually):

○ More dense, and available for all users

○ Better representative of user behavior vs.

user reflection

○ More related to final objective function

○ Better correlated with AB test results

● E.g. Rating vs watching


● However

○ It is not always the case that

direct implicit feedback correlates

well with long-term retention

○ E.g. clickbait

● Solution:

○ Combine different forms of

implicit + explicit to better represent

long-term goal


be thoughtful about your Training Data

Defining training/testing data

● Training a simple binary classifier for good/bad answer○ Defining positive and negative labels ->

Non-trivial task○ Is this a positive or a negative?

● funny uninformative answer with many upvotes● short uninformative answer by a well-known

expert in the field● very long informative answer that nobody

reads/upvotes● informative answer with grammar/spelling

mistakes● ...

Other training data issues: Time traveling

● Time traveling: usage of features that originated after the event you are trying to predict○ E.g. Your upvoting an answer is a pretty good prediction

of you reading that answer, especially because most upvotes happen AFTER you read the answer

○ Tricky when you have many related features○ Whenever I see an offline experiment with huge wins, I

ask: “Is there time traveling?”

Your Model will learn what you teach it to learn

Training a model

● Model will learn according to:

○ Training data (e.g. implicit and explicit)

○ Target function (e.g. probability of user reading an answer)

○ Metric (e.g. precision vs. recall)

● Example 1 (made up):

○ Optimize probability of a user going to the cinema to

watch a movie and rate it “highly” by using purchase history

and previous ratings. Use NDCG of the ranking as final

metric using only movies rated 4 or higher as positives.

Example 2 - Quora’s feed

● Training data = implicit + explicit

● Target function: Value of showing a story to a

user ~ weighted sum of actions: v = ∑a va 1{ya = 1}

○ predict probabilities for each action, then compute expected

value: v_pred = E[ V | x ] = ∑a va p(a | x)

● Metric: any ranking metric

Offline testing

● Measure model performance, using (IR) metrics

● Offline performance = indication to make decisions on follow-up A/B tests

● A critical (and mostly unsolved) issue is how offline metrics correlate with A/B test results.

Learn to deal with Presentation Bias

2D Navigational modeling

More likely to see

Less likely

The curse of presentation bias

● User can only click on what you decide to show● But, what you decide to show is the result of what your model

predicted is good● Simply treating things you show as negatives is not likely to work● Better options

● Correcting for the probability a user will click on a position -> Attention models

● Explore/exploit approaches such as MAB

You don’t need to distribute your ML algorithm

Distributing ML

● Most of what people do in practice can fit into a multi-

core machine

○ Smart data sampling

○ Offline schemes

○ Efficient parallel code

● Dangers of “easy” distributed approaches such

as Hadoop/Spark

● Do you care about costs? How about latencies?

Distributing ML

● Example of optimizing computations to fit them into

one machine

○ Spark implementation: 6 hours, 15 machines

○ Developer time: 4 days

○ C++ implementation: 10 minutes, 1 machine

● Most practical applications of Big Data can fit into

a (multicore) implementation

The untold story of Data Science and vs. ML engineering

Data Scientists and ML Engineers

● We all know the definition of a Data Scientist

● Where do Data Scientists fit in an organization?

○ Many companies struggling with this

● Valuable to have strong DS who can bring value

from the data

● Strong DS with solid engineering skills are

unicorns and finding them is not scalable○ DS need engineers to bring things to production

○ Engineers have enough on their plate to be willing to

“productionize” cool DS projects

The data-driven ML innovation funnel

Data Research

ML Exploration - Product Design

AB Testing

Data Scientists and ML Engineers

● Solution:

○ (1) Define different parts of the innovation funnel

■ Part 1. Data research & hypothesis

building -> Data Science

■ Part 2. ML solution building &

implementation -> ML Engineering

■ Part 3. Online experimentation, AB

Testing analysis-> Data Science

○ (2) Broaden the definition of ML Engineers

to include from coding experts with high-level

ML knowledge to ML experts with good

software skills

Data Research

ML Solution

AB Testing

Data

ScienceD

ata Science

ML

Engineering

Conclusions

● In data, size is not all that matters● Understand dependencies between data, models

& systems● Choose the right metric & optimize what matters● Be thoughtful about

○ your ML infrastructure/tools

○ about organizing your teams

Questions?

Strata 2016 - Lessons Learned from building real-life Machine Learning Systems

Technology

Transcript of Strata 2016 - Lessons Learned from building real-life Machine Learning Systems