Holistic approach to machine learning

78
@SrcMinistry @MariuszGil Holistic approach to Machine Learning Data processing

Transcript of Holistic approach to machine learning

Page 1: Holistic approach to machine learning

@SrcMinistry @MariuszGil

Holistic approach to Machine Learning

Data processing

Page 2: Holistic approach to machine learning

@SrcMinistry

Page 3: Holistic approach to machine learning

We are developers

Page 4: Holistic approach to machine learning

We love to…

Page 5: Holistic approach to machine learning

Write code

Page 6: Holistic approach to machine learning

Write tests

Page 7: Holistic approach to machine learning

Use DDD/OOP/AOP/SOLID/GRASP/XYZ

Page 8: Holistic approach to machine learning

What for?

Page 9: Holistic approach to machine learning

Write code

Page 10: Holistic approach to machine learning

Make money

Page 11: Holistic approach to machine learning

Make users happy

Page 12: Holistic approach to machine learning

Solve problems

Page 13: Holistic approach to machine learning

Solve problems by writing code, to make users happy and make money

Page 14: Holistic approach to machine learning

Solve problems by writing code, to make users happy and make money

Solve problems

Page 15: Holistic approach to machine learning

Solve problems by writing code, to make users happy and make money

Solve

Page 16: Holistic approach to machine learning

Solve problems by writing code, to make users happy and make money

problems

Page 17: Holistic approach to machine learning

Mapping all problems to DDD/OOP/AOP/SOLID/GRASP/XYZ

Page 18: Holistic approach to machine learning

Test first

Page 19: Holistic approach to machine learning

Understand the problem first

Page 20: Holistic approach to machine learning

Domain knowledge

Page 21: Holistic approach to machine learning

Ask expert

Page 22: Holistic approach to machine learning

Real problems

Page 23: Holistic approach to machine learning

Data classification

Page 24: Holistic approach to machine learning

Bot detection

Page 25: Holistic approach to machine learning

Minimize risk of error

Page 26: Holistic approach to machine learning
Page 27: Holistic approach to machine learning

+ value estimator

Page 28: Holistic approach to machine learning

+ chance of sell

Page 29: Holistic approach to machine learning

+ $ optimization

Page 30: Holistic approach to machine learning

Tens of thousands historical transactions

Page 31: Holistic approach to machine learning

Tens of data components

Page 32: Holistic approach to machine learning

Hundreds of data components

Page 33: Holistic approach to machine learning
Page 34: Holistic approach to machine learning

IF-Unsolveable

Page 35: Holistic approach to machine learning

Machine Learning

Page 36: Holistic approach to machine learning

The theory

Page 37: Holistic approach to machine learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E

Tom M. Mitchell

Page 38: Holistic approach to machine learning

Task

Page 39: Holistic approach to machine learning

Typical ML techniquesClassification Regression Clustering Dimensionality reduction Association learning

Page 40: Holistic approach to machine learning

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Page 41: Holistic approach to machine learning

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Page 42: Holistic approach to machine learning

oooo

ooo

ooo oo

o oo

oo o o oo

ooo

oo

o

feature 1

feat

ure

2

Page 43: Holistic approach to machine learning

Experience

Page 44: Holistic approach to machine learning

Typical ML paradigmsSupervised learning Unsupervised learning Reinforcement learning

Page 45: Holistic approach to machine learning

Accuracy

Page 46: Holistic approach to machine learning

The practice

Page 47: Holistic approach to machine learning
Page 48: Holistic approach to machine learning

data + algo = result

Page 49: Holistic approach to machine learning

+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+

Page 50: Holistic approach to machine learning

Learning Data

Algorithm Learning

Classifier ModelReal Data Classification

Page 51: Holistic approach to machine learning

Failure recipe

Page 52: Holistic approach to machine learning

+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+

Page 53: Holistic approach to machine learning

+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+

Page 54: Holistic approach to machine learning

+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+

Page 55: Holistic approach to machine learning

+-------+--------+-----+------+--------+---------+--------+-------+ | brand | model | gen | year | milage | service | repair | price | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+

Page 56: Holistic approach to machine learning

+-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | brand | model | gen | year | milage | service | repair | igla | crying German | price | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 0 | 0 | 67000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 1 | 1 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 0 | 0 | 45000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 1 | 0 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+

Page 57: Holistic approach to machine learning

Understand your data first

Page 58: Holistic approach to machine learning

Exploratory analysis

Page 59: Holistic approach to machine learning

http://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/aq2.jpg

Page 60: Holistic approach to machine learning

ML pipeline

Page 61: Holistic approach to machine learning

Raw Data Collection

Pre-processing

Sampling

Training Dataset

Algorithm Training

Optimization

Post-processing

Final model

Pre-processingFeature Selection

Feature Scaling

Dimensionality Reduction

Performance Metrics

Model Selection

Test Dataset

Cro

ss V

alid

atio

n

Final ModelEvaluation

Pre-processing Classification

Missing Data

Feature Extraction

DataSplit

Data

Page 62: Holistic approach to machine learning

Raw Data Collection

Pre-processing

Sampling

Training Dataset

Algorithm Training

Optimization

Final model

Pre-processingFeature Selection

Feature Scaling

Dimensionality Reduction

Performance Metrics

Model Selection

Test Dataset

Cro

ss V

alid

atio

n

Final ModelEvaluation

Pre-processing Classification

Missing Data

Feature Extraction

DataSplit

Post-processing

Data

Page 63: Holistic approach to machine learning

Classification algorithmsLinear Classification Logistic Regression Linear Discriminant Analysis PLS Discriminant Analysis

Non-Linear Classification Mixture Discriminant Analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis Neural Networks Flexible Discriminant Analysis Support Vector Machines k-Nearest Neighbor Naive Bayes

Decission Trees for Classification Classification and Regression Trees C4.5 PART Bagging CART Random Forest Gradient Booster Machines Boosted 5.0

Page 64: Holistic approach to machine learning

Regression algorithmsLinear Regiression Ordinary Least Squares Regression Stepwise Linear Regression Prinicpal Component Regression Partial Least Squares Regression

Non-Linear Regression / Penalized Regression Ridge Regression Least Absolute Shrinkage ElasticNet Multivariate Adaptive Regression Support Vector Machines k-Nearest Neighbor Neural Network

Decission Trees for Regression Classification and Regression Trees Conditional Decision Tree Rule System Bagging CART Random Forest Gradient Boosted Machine Cubist

Page 65: Holistic approach to machine learning

Algorithm is only element in the ML chain

Page 66: Holistic approach to machine learning

Everything may be important for ML

Page 67: Holistic approach to machine learning
Page 68: Holistic approach to machine learning

Testing

Page 69: Holistic approach to machine learning

Test datasets

Page 70: Holistic approach to machine learning

60% 20% 20%

Page 71: Holistic approach to machine learning

Andrew NG rule of ML

Page 72: Holistic approach to machine learning

Does it do well onthe training data?

Does it do well onthe test data?

Better features /Better parameters

More data

Done!

No No

Yes

by Andrew Ng

Yes

Page 73: Holistic approach to machine learning

Calculate, measure, apply later

Page 74: Holistic approach to machine learning

The code

Page 75: Holistic approach to machine learning

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1)

// Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold. model.clearThreshold()

// Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) }

// Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

// Save and load model model.save(sc, "myModelPath") val sameModel = SVMModel.load(sc, "myModelPath")

Page 76: Holistic approach to machine learning

Art of asking right questions related to right data

Page 77: Holistic approach to machine learning

@SrcMinistry

Thanks!

@MariuszGil

Page 78: Holistic approach to machine learning