Holistic approach to machine learning

Post on 15-Apr-2017

351 views 1 download

Transcript of Holistic approach to machine learning

@SrcMinistry @MariuszGil

Holistic approach to Machine Learning

Data processing

@SrcMinistry

We are developers

We love to…

Write code

Write tests

Use DDD/OOP/AOP/SOLID/GRASP/XYZ

What for?

Write code

Make money

Make users happy

Solve problems

Solve problems by writing code, to make users happy and make money

Solve problems by writing code, to make users happy and make money

Solve problems

Solve problems by writing code, to make users happy and make money

Solve

Solve problems by writing code, to make users happy and make money

problems

Mapping all problems to DDD/OOP/AOP/SOLID/GRASP/XYZ

Test first

Understand the problem first

Domain knowledge

Ask expert

Real problems

Data classification

Bot detection

Minimize risk of error

+ value estimator

+ chance of sell

+ $ optimization

Tens of thousands historical transactions

Tens of data components

Hundreds of data components

IF-Unsolveable

Machine Learning

The theory

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E

Tom M. Mitchell

Task

Typical ML techniquesClassification Regression Clustering Dimensionality reduction Association learning

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Experience

Typical ML paradigmsSupervised learning Unsupervised learning Reinforcement learning

Accuracy

The practice

data + algo = result

+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+

Learning Data

Algorithm Learning

Classifier ModelReal Data Classification

Failure recipe

+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+

+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+

+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+

+-------+--------+-----+------+--------+---------+--------+-------+ | brand | model | gen | year | milage | service | repair | price | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+

+-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | brand | model | gen | year | milage | service | repair | igla | crying German | price | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 0 | 0 | 67000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 1 | 1 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 0 | 0 | 45000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 1 | 0 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+

Understand your data first

Exploratory analysis

http://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/aq2.jpg

ML pipeline

Raw Data Collection

Pre-processing

Sampling

Training Dataset

Algorithm Training

Optimization

Post-processing

Final model

Pre-processingFeature Selection

Feature Scaling

Dimensionality Reduction

Performance Metrics

Model Selection

Test Dataset

Cro

ss V

alid

atio

n

Final ModelEvaluation

Pre-processing Classification

Missing Data

Feature Extraction

DataSplit

Data

Raw Data Collection

Pre-processing

Sampling

Training Dataset

Algorithm Training

Optimization

Final model

Pre-processingFeature Selection

Feature Scaling

Dimensionality Reduction

Performance Metrics

Model Selection

Test Dataset

Cro

ss V

alid

atio

n

Final ModelEvaluation

Pre-processing Classification

Missing Data

Feature Extraction

DataSplit

Post-processing

Data

Classification algorithmsLinear Classification Logistic Regression Linear Discriminant Analysis PLS Discriminant Analysis

Non-Linear Classification Mixture Discriminant Analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis Neural Networks Flexible Discriminant Analysis Support Vector Machines k-Nearest Neighbor Naive Bayes

Decission Trees for Classification Classification and Regression Trees C4.5 PART Bagging CART Random Forest Gradient Booster Machines Boosted 5.0

Regression algorithmsLinear Regiression Ordinary Least Squares Regression Stepwise Linear Regression Prinicpal Component Regression Partial Least Squares Regression

Non-Linear Regression / Penalized Regression Ridge Regression Least Absolute Shrinkage ElasticNet Multivariate Adaptive Regression Support Vector Machines k-Nearest Neighbor Neural Network

Decission Trees for Regression Classification and Regression Trees Conditional Decision Tree Rule System Bagging CART Random Forest Gradient Boosted Machine Cubist

Algorithm is only element in the ML chain

Everything may be important for ML

Testing

Test datasets

60% 20% 20%

Andrew NG rule of ML

Does it do well onthe training data?

Does it do well onthe test data?

Better features /Better parameters

More data

Done!

No No

Yes

by Andrew Ng

Yes

Calculate, measure, apply later

The code

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1)

// Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold. model.clearThreshold()

// Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) }

// Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

// Save and load model model.save(sc, "myModelPath") val sameModel = SVMModel.load(sc, "myModelPath")

Art of asking right questions related to right data

@SrcMinistry

Thanks!

@MariuszGil