Towards Increasing Predictability of Machine Learning Research

Towards increasing predictability of machine-learning research

Artem Vorozhtsov

Yandex LCC

System for displaying ads

on Yandex’s search result pages

and partner’s websites

Ad Targeting Group

Automation of Machine Learning Research

Research with profit

Introduction

R&D best practices

— Modularity

— Computational Measurability

— Transparency and Sharing

— Automation

— Modularity

Units, reuse, abstraction

R&D best practices

— Computational Measurability

Metrics Driven Development

R&D best practices

— Transparency and Sharing

Collaboration and reproducibility

R&D best practices

— Automation

…

R&D best practices

R&D best practices

— Modularity: units, reuse, abstractions

— Computational Measurability: MDD

— Transparency and Sharing: collaboration

— Automation: …

Happy life principles

— Kindness

— Wholeheartedness

— Love

— Discipline

— Self-development

Happy life principles

— Kindness

— Wholeheartedness

— Love

— Discipline

— Self-development

This is a list of global things,not local (everyday) rules

clipart from Scrappindoodles

— Where does automation stop?

— Story of automating

— Everyday rules

— Questions

Plan

Automation is not

Automation –

is the use of machines, control systems

and information technologies to reduce

the need for human work to optimize productivity

in the production of goods and services.

Automation –

is the use of information technologies

to optimize productivity and to increase

predictability in the research, development

and other projects.

Complex KPIs

Where does automation stop?

KPI stands for Key Performance Indicators:

— Money, Clicks on Ads

— Comparison with rivals (# of segments we are better)

— Number of Nobel Prices

— Users & Government Loyalty

— Logliklihood of prediction


— Strategy Thinking, Complex KPIs

— Research


.. where real research starts


IntuitionResearch

Creativity

ScienceTools

Complex Maths

Automated pipelines

PDEs

MetricsValidators

SVMPCA

1. Imagine how simple and agile research

work could be.

2. Believe it is possible, automate the most

and find the place for research.

Recipe

Task:Ad click probability prediction(binary classification problem)

KPI: Profit, Clicks, Conversions, Loglikelihood

Yandex LLC

Story of automation

Story of automation

Classifier(matrixnet)

filtersfiltersfiltersfilters

filtersfiltersfiltersreducers

filtersfiltersfiltersmetrics

GnuPlot

filterssimulators

MapReduceSTORAGE

clipart from http://www.stoneys.ch

Story of automation

Classifier(matrixnet)




GnuPlot

filterssimulators

MapReduceSTORAGE

clipart from http://clipartov.net

Story of automation

Classifier(TMVA, …)




GnuPlot

filterssimulators

MapReduceSTORAGE

ML Infrastructure

Report

Idea

Pipeline (no automation)

— Prepare raw data set for ML

— Apply filters (cuts) and mappers

— Calculate features

— Assign weights

— Split to train and test

— Train classifier at training set

— Look at learn curve and check for overfitting

— Apply resulted classifier model to testing set

— Calculate metrics and compare with current best

Story of automation

Pipeline (no automation)

— Prepare raw data set for ML

— Apply filters (cuts) and mappers (add new filter)

— Calculate features (add new feature)

— Assign weights (new idea for weighting)

— Split to train and test

— Train classifier at training set (new train options)

— Look at learn curve and check for overfitting

— Apply resulted classifier model to testing set

— Calculate metrics and compare with current best

Story of automation

— Create and commit YAML file

— Read the report

Story of automation

Engine: “matrixnet” # options: VW, TMVA (TODO!)Mappers: | [ Join(‘PLACE FOR NEW FEATURES’), Grep(‘r.Age > 10 and PLACE FOR GREP IDEA'), Mapper(‘r.Weight = PLACE FOR WEIGHT IDEA’), yabs.matrixnet.factor.DefaultFactors(), ]MailTo: [email protected]: ‘PLACE FOR NEW OPTIONS’Tables: ‘EFHFactors:last_14_days’

Pipeline (with automation)

Story of automation





GnuPlot

filterssimulators

MapReduceSTORAGE

ML Infrastructure

Report

YAML-file

Story of automation

metric | learn | test | test cur.---------------------------------------ll_p | 0.38171 | 0.36074 | 0.14527 ll_r | 0.38966 | 0.37151 | 0.33247 f1_p | 0.44869 | 0.44430 | 0.43266 fom_p | 0.91526 | 0.90580 | 0.88528 kl_p | 0.31143 | 0.29581 | 0.13186 log_loss | 0.39965 | 0.40354 | 0.44178 mcc_p | 0.30788 | 0.30159 | 0.28512 q10_p | 2.6632 | 2.5994 | 2.5261 q2_p | 1.6315 | 1.6212 | 1.5886 q_p | 1.6244 | 1.6089 | 1.5777

Report

Story of automation

Report

Story of automationML Infrastructure





GnuPlot

filterssimulator

s

MapReduceSTORAGE

ProductionReport (Money, Clicks)

Experiment (1%)

Deploy new model

YAML-file

Report (llp)

Report (Money, Clicks)

Idea

Idea

Idea

Deploynew model

Challenges (scientific)

— Multi-armed bandit problem• Banner is black box with estimated CTR• Historical data is used for prediction

— Default model bias• Training set is generated by default model

— Move from KPIs to metrics and cost functions • Business Strategy (approx) metrics

— Balancing between different cost functions• Clicks, Money, Conversions, CPA

Challenge (automation):Graphical Pipelines Framework

Simulationdata

Experimental data

map

train

Cut by threshold

Show mass

distribution

Filter backgroun

d

Estimate mixture

parameters

classify

map

Run

Automation for me is:

— Tools

What is Automation?


— Tools (in TMVA)

What is Automation?

Normalization

Rectangular Cuts

SVM Boosted Trees

Gaussianisation

PCA

PDE

Decorrelation

Genetic Algorithms


— Tools

• Macro language (high level language)

for expressing ideas

What is Automation?

Simulationdata

Experimental data

map

train

Filter by threshold

Show mass distribution

Filter background

Estimate mixture

parametersclassifymap


— Tools



— Infrastructure

• Connecting with arrows

• Whole pipeline coverage

What is Automation?


— Tools



— Infrastructure



— Specialization

• Collaboration and delegation

What is Automation?


—…

— Specialization


What is Automation?

classifiertrain set model

parameters

Parameters

What is Automation?

Comp. Complexity

Model

ProperDefective

Cost FunctionLearning rate

Tree depth

RegularizationFeatures TypesNumber of trees


— Tools

• Macro language (high level language) for

expressing ideas

— Infrastructure



— Specialization


What is Automation?

(1) Copy and paste data

— Add new boxes to automated pipeline

— Automate transport between all boxes

— Do not use strange software

Everyday rules: anti-patterns

(2) Execute data pipeline steps manually in a cycle.

— Define new command for this pipeline

— Use standard formats for data streams

— Define needed ‘mappers’ and ‘reducers’ for data

stream and use them


(3) Your code is >3 times longer than natural language

description

— Start working on new tools (macro languages, DSL)


(4) It takes >1 man-hour to recalculate final graph of

your research

— Automate the whole pipeline


(5) You write line of code that has no chance of being

executed >10,000 times





Code (>10000 times) Interactive Data Analysis (once)

def pca(data, reduce_dims=0, corr=True, normalise=False,subtract_mean=True): data_mean = None if subtract_mean: data_mean = mean(data, axis=0) data -= data_mean transposed = transpose(data) cov_matrix = corrcoef(transposed) # Compute eigenvalues and sort into

# descending order eigen_vals,eigen_vecs = linalg.eig(cov_matrix) indices = argsort(eigen_vals) indices = indices[::-1] eigen_vecs = eigen_vecs[:, indices] eigen_vals = eigen_vals[indices]

data = filter(data, “RegionID = 213”)data1, data2 = split_random(data)data2ext = decorrelate(data1, data2, fields = [“age”, “income”, …])report = check_features(data2ext) show_report(report)



Choose one action a time (A) or (B):

A. Interactive data analysis using high level tools

B. Coding: extending/improving tools library or infrastructure. Delegate it?

There is no other options.


(6) Your colleagues think that you are doing something

useless

— Stop doing questionable things


(7) You have a dream, and it hasn’t came true yet

— Tell Yandex about your dream


Artem Vorozhtsov

Head of Ads Targeting Group

[email protected]

Thank you!

Towards Increasing Predictability of Machine Learning Research

Technology

Transcript of Towards Increasing Predictability of Machine Learning Research