Holistic approach to machine learning
-
Upload
source-ministry -
Category
Software
-
view
351 -
download
1
Transcript of Holistic approach to machine learning
@SrcMinistry @MariuszGil
Holistic approach to Machine Learning
Data processing
@SrcMinistry
We are developers
We love to…
Write code
Write tests
Use DDD/OOP/AOP/SOLID/GRASP/XYZ
What for?
Write code
Make money
Make users happy
Solve problems
Solve problems by writing code, to make users happy and make money
Solve problems by writing code, to make users happy and make money
Solve problems
Solve problems by writing code, to make users happy and make money
Solve
Solve problems by writing code, to make users happy and make money
problems
Mapping all problems to DDD/OOP/AOP/SOLID/GRASP/XYZ
Test first
Understand the problem first
Domain knowledge
Ask expert
Real problems
Data classification
Bot detection
Minimize risk of error
+ value estimator
+ chance of sell
+ $ optimization
Tens of thousands historical transactions
Tens of data components
Hundreds of data components
IF-Unsolveable
Machine Learning
The theory
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E
Tom M. Mitchell
Task
Typical ML techniquesClassification Regression Clustering Dimensionality reduction Association learning
oooo
ooo
ooo oo
o oo
oo o o oo
ooo
oo
o
feature 1
feat
ure
2
oooo
ooo
ooo oo
o oo
oo o o oo
ooo
oo
o
feature 1
feat
ure
2
oooo
ooo
ooo oo
o oo
oo o o oo
ooo
oo
o
feature 1
feat
ure
2
Experience
Typical ML paradigmsSupervised learning Unsupervised learning Reinforcement learning
Accuracy
The practice
data + algo = result
+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+
…
Learning Data
Algorithm Learning
Classifier ModelReal Data Classification
Failure recipe
+-------+--------+------+--------+---------+-------+ | brand | model | year | milage | service | price | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 123000 | 9900 | 67000 | +-------+--------+------+--------+---------+-------+ | ford | mondeo | 2005 | 175000 | 9900 | 30000 | +-------+--------+------+--------+---------+-------+ | ford | focus | 2010 | 45000 | 6700 | 30000 | +-------+--------+------+--------+---------+-------+
…
+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+
…
+-------+--------+------+--------+---------+--------+-------+ | brand | model | year | milage | service | repair | price | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+------+--------+---------+--------+-------+ | ford | mondeo | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+------+--------+---------+--------+-------+ | ford | focus | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+------+--------+---------+--------+-------+
…
+-------+--------+-----+------+--------+---------+--------+-------+ | brand | model | gen | year | milage | service | repair | price | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 67000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 45000 | +-------+--------+-----+------+--------+---------+--------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 30000 | +-------+--------+-----+------+--------+---------+--------+-------+
…
+-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | brand | model | gen | year | milage | service | repair | igla | crying German | price | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 123000 | 9000 | 900 | 0 | 0 | 67000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 3 | 2005 | 175000 | 900 | 9000 | 1 | 1 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | mondeo | 4 | 2005 | 175000 | 900 | 9000 | 0 | 0 | 45000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+ | ford | focus | 4 | 2010 | 45000 | 3700 | 3000 | 1 | 0 | 30000 | +-------+--------+-----+------+--------+---------+--------+------+---------------+-------+
…
Understand your data first
Exploratory analysis
http://blogs.adobe.com/digitalmarketing/wp-content/uploads/2013/08/aq2.jpg
ML pipeline
Raw Data Collection
Pre-processing
Sampling
Training Dataset
Algorithm Training
Optimization
Post-processing
Final model
Pre-processingFeature Selection
Feature Scaling
Dimensionality Reduction
Performance Metrics
Model Selection
Test Dataset
Cro
ss V
alid
atio
n
Final ModelEvaluation
Pre-processing Classification
Missing Data
Feature Extraction
DataSplit
Data
Raw Data Collection
Pre-processing
Sampling
Training Dataset
Algorithm Training
Optimization
Final model
Pre-processingFeature Selection
Feature Scaling
Dimensionality Reduction
Performance Metrics
Model Selection
Test Dataset
Cro
ss V
alid
atio
n
Final ModelEvaluation
Pre-processing Classification
Missing Data
Feature Extraction
DataSplit
Post-processing
Data
Classification algorithmsLinear Classification Logistic Regression Linear Discriminant Analysis PLS Discriminant Analysis
Non-Linear Classification Mixture Discriminant Analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis Neural Networks Flexible Discriminant Analysis Support Vector Machines k-Nearest Neighbor Naive Bayes
Decission Trees for Classification Classification and Regression Trees C4.5 PART Bagging CART Random Forest Gradient Booster Machines Boosted 5.0
Regression algorithmsLinear Regiression Ordinary Least Squares Regression Stepwise Linear Regression Prinicpal Component Regression Partial Least Squares Regression
Non-Linear Regression / Penalized Regression Ridge Regression Least Absolute Shrinkage ElasticNet Multivariate Adaptive Regression Support Vector Machines k-Nearest Neighbor Neural Network
Decission Trees for Regression Classification and Regression Trees Conditional Decision Tree Rule System Bagging CART Random Forest Gradient Boosted Machine Cubist
Algorithm is only element in the ML chain
Everything may be important for ML
Testing
Test datasets
60% 20% 20%
Andrew NG rule of ML
Does it do well onthe training data?
Does it do well onthe test data?
Better features /Better parameters
More data
Done!
No No
Yes
by Andrew Ng
Yes
Calculate, measure, apply later
The code
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.util.MLUtils
// Load training data in LIBSVM format. val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split data into training (60%) and test (40%). val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1)
// Run training algorithm to build the model val numIterations = 100 val model = SVMWithSGD.train(training, numIterations)
// Clear the default threshold. model.clearThreshold()
// Compute raw scores on the test set. val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) }
// Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
// Save and load model model.save(sc, "myModelPath") val sameModel = SVMModel.load(sc, "myModelPath")
Art of asking right questions related to right data
@SrcMinistry
Thanks!
@MariuszGil