Good Enough Analytics

Post on 26-Jan-2015

105 views 1 download

Tags:

description

Presented @ Bigdata Singapore Meetup. Good Enough Analytics is a methodology I am working on to achieve decent analytical results at a reasonable cost. Warning: For the consumption of Data Nerds Only. For 99% of normal humans, these slides are snooze inducing =P.

Transcript of Good Enough Analytics

Good Enough Analyticsby Kai Xin

The Good Enough StuffAnalytical Tools

Analytical Tools are like spoons

Analytical Tools are like spoons

Usefulness

Usefulness

Point of stupidity

Usefulness

Point of stupidity

Usefulness

Point of stupidity

Point of stupidity

Point of stupidity

What is stupid today, might not be stupid tomorrow

Good Enough AnalyticsBig data analytics using cost efficient tools

The Good Enough StuffEnsembles of good enough models

Point of stupidity: The perfect model4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3

A “perfect” model is too complex, too costly to build, too hard to maintain and not

flexible to change.

“There are known knowns; there are things we know that we know.

There are known unknowns; there are things that we now know we don't know.

But there are also unknown unknowns;there are things we do not know we don't know.”

By Donald Rumsfeld, United States Secretary of Defense and Potential Data Scientist

Why the perfect model is stupid

“In statistics and machine learning, ensemble

methods use multiple models to obtain better predictive performance than could be obtained

from any of the constituent models”

Good Enough Analytics: Ensembles4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6

+1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4

+1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6

scholarpedia.orgRefer to References

scholarpedia.orgRefer to References

scholarpedia.orgRefer to References

The Serious Stuff…beyond theorycraft

Simple Ensembles – GLM Bootstrap aggregating (bagging)

predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE)

train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") }result<-rowMeans(predictions)

Simple Ensembles – Gradient Boosting Machines

gbmMod<-gbm(eqn, train,n.trees=10000, shrinkage=0.002, distribution="gaussian", interaction.depth=7, bag.fraction=0.9,

n.minobsinnode = 50 )

Similar to bagging, boosting also creates an ensemble of classifiers by resampling the data, which are then combined

by majority voting. However, in boosting, resampling is strategically geared to provide the most informative training

data for each consecutive classifier.

Simple Ensembles - Random Forest

rf <- foreach(ntree=rep(333,3), .combine=combine, .packages='randomForest')%dopar%

randomForest(train[,3:length(train)], train$Act, ntree=ntree, do.trace=1000, mtry=round(colNumber/3), replace=FALSE, nodesize = 5, na.action=na.omit)

Ensemble of Ensembles

1. Mean(RF+GBM+BagGLM)2. Median(RF+GBM+BagGLM)3. 0.4*RF+0.4*GBM+0.2*BagGLM

Ensembles – Why it mattersImprove accuracyEnsembles tend to yield better results than its constituent models when there is a significant diversity among the models

Developing multiple simple model is faster attempting to develop the perfect model

More resistance to over fitting Less reliant on any single model

Concurrent developmentDifferent models can be run and developed on different instances/machines by different data scientist

Ensembles – point of stupidity

Netflix prize 1 million dollar winner: Ensemble of 107 models for 10% improvementToo complicated, costly and inflexible to change

Actual deployment: Ensemble of 2 models for 8.43% improvement Moral of story:Good Enough Ensemble is good enough

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models

The Good Enough StuffData Optimization

Data cleaning vs Data optimization

Important but I assume you know

Done AFTER data cleaning

Kaggle Medical Drug Competition

15 sets of dataEach data set:

1,000 to 2,000 Attributes500 to 20,000 Rows

Qn: Identify rogue drugs

Point of stupidity: Trying to run analysis on all attributes

Drug Rogue %

Company Color Component 1

Component 2…2000

A 0.0400 XYZ Red 200 30

B 0.0002 XYZ Green 920 50

C 0.8000 XYZ Blue 30 1000

D ? XYZ Red 340 800

Drug Rogue %

Company Color Component 1

Component 2…2000

A 0.0400 XYZ Red 200 30

B 0.0002 XYZ Green 920 50

C 0.8000 XYZ Blue 30 1000

D ? XYZ Red 340 800

Not all attributes are born equalNo

Variance Irrelevant Too many attributes

Drug Rogue %

Company

A 0.0400 XYZ

B 0.0002 XYZ

C 0.8000 XYZ

D ? XYZ

R code: Library(caret)healthdata[nearZeroVar(healthdata, freqCut = 95/5, uniqueCut = 10)]<-list(NULL)

<- this attribute does not help in differentiating

between the drugs

Remove no variance / near zero variance attributes

Drug Rogue %

Color

A 0.0400 Red

B 0.0002 Green

C 0.8000 Blue

D ? Red

R code for Random Forest: importanceScore <- importance(myMod)

R code for GBM: importanceScore <- summary.gbm (myMod, ntree)

<- this attribute has no relevance to % rouge drug

Remove not important attributes

Drug Rogue % Component 1

Component 2…2000

A 0.0400 200 30

B 0.0002 920 50

C 0.8000 30 1000

D ? 340 800

R code:pc <- prcomp(train[, 2:length(train)],tol=0.12)

<- too many attributes takes very long to run

analysis

Attribute reduction using Principal Component Analysis

Andrew Ng: Always try analysis without PCA first.

X XXXX X

Attribute 1

Attribute 2

Attribute reduction using Principal Component Analysis

Andrew Ng: Machine Learning CourseRefer to References

Andrew Ng: Always try analysis without PCA first.

X XXXX XPrincipal Component

Attribute reduction using Principal Component Analysis

Andrew Ng: Machine Learning CourseRefer to References

X

X

X

X

X

X

Attribute 1

Attribute 2

Attribute reduction using Principal Component Analysis

Andrew Ng: Machine Learning CourseRefer to References

The 1D red line and points are now representative of the 2D graph

Principal Component

Attribute reduction using Principal Component Analysis

0

00

00

0

Andrew Ng: Machine Learning CourseRefer to References

Data Optimization – Why it matters

Performance Improvement (importance,nearZeroVar)

Cut down attributes which are useless or not “good enough”. More accurate and complex models can be built on attributes that matters.

Cost Savings (PCA)

Less data needs to be processed, faster turnover for models and results.

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data

The Good Enough StuffScaling on cloud

Why use Cloud

How often do you really need a multimillion machine to be on standby 24/7 to churn data?

Do you really need real time analytics or is hourly/daily/weekly/monthly report good enough?

Cloud – Why it mattersExcellent bang for the buck<$5/hr to rent million dollar worth of power. No need to purchase/maintain hardware. Scale on demand

Great for Ensemble ModelingYou can start multiple instance, each instance running one simple model and ensemble them

But beware of data security and privacy lawsNot suitable for all kinds of data/application For example, Amazon Web Service is HIPAA compliant but Rackspace is not.

Name Age Income Postal

Peter 23 $2,000 400573

Sally 11 $0 520028

Paul 70 $500 521201

Mark 30 $8,000 247392

Prepare data for the cloud

Name Age Age Group

Income Income Range

Postal Postal Area

Peter 23 Youth $2,000 $1,000-$3,000

400*** Eunos

Sally 11 Child $0 $0 520*** Simei

Paul 70 Senior $500 $1-$1,000 521*** Tampines

Mark 30 Adult $8,000 >$5,000 247*** Tanglin

Prepare data for the cloud

RemoveIdentity

Use general category

Reference: Dr. Yap Ghim Eng (A*Star)

Use range category Masking Rollup

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data, scaled on cloud services

The Good Enough Stuff…that we have no time for

Amazon Web Service

sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blassudo yum install -y lapack-devel blas-devel

wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gztar -xf R-2.15.2.tar.gzcd R-2.15.2./configure --with-x=nosudo makePATH=$PATH:~/R-2.15.2/bin/cd ..

wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/downloadtar -xzf numpy-1.6.2.tar.gzcd numpy-1.6.2sudo python setup.py installcd ..

wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/downloadtar -xzf scipy-0.11.0.tar.gzcd scipy-0.11.0sudo python setup.py installcd ..

wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817tar -xzf nose-1.1.2.tar.gzcd nose-1.1.2sudo python setup.py install

Basic code to setup Amazon instance for analytics

=after sudo-ing and running R, type=install.packages('gbm')install.packages('randomForest')

To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"

Amazon EC2 Spot InstanceCluster Compute Eight Extra Large60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet$0.27 per hour

High-Memory Quadruple Extra Large Instance 68.4 GiB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform$0.14 per hour

Weakness of Spot Instance

Bidding system. If your bid < spot instance price, instance will be terminated.

Solutions:1) Put master on normal cloud instance

and slave on spot instance2) Heartbeat + Queue with Checkpoint

The Good Enough Stuff…that we have no time for

PCA with KNN

library(FNN)train <- read.csv("train.csv", header=TRUE)test <- read.csv("test.csv", header=TRUE)

pc <- prcomp(train[, 2:length(train)],tol=0.12)mydata <- data.frame(label = train[, "label"], pc$x)labels <- mydata[,1]mydata2 <- mydata[,-1]test.p <- predict(pc, newdata = test)

results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1, algorithm="cover_tree")]write(results, file="knn_PCA.csv", ncolumns=1)

Principal Component Analysis - With K-Nearest Neighbor

The Good Enough Stuff…that we have no time for

Data Chunking

Data Chunking– Revolution R

Loosely based on NoSQL

The XDF format is a binary file format that stores data in blocks and processes data in chunks (groups of blocks) for efficient reading of arbitrary columns and contiguous rows

Use a format called XDF

For more details, visit RevR website

Data Chunking– Why it matters# Chunk 6.5GB worth of data onto HDD in XDFrxImport(inData = trainFile, outFile = “trainingData.xdf”)

#revR created methods like rxGlm to run huge Poisson regression directly on XDF file myPos <- rxGlm(amount2 ~ Mailed+Donated+RR,data="trainingData", family=poisson())*This cannot be done using normal R on my laptop, as R tries to load entire dataset into memory

RAM: Fast but expansive

SSD: ~4x faster than normal HDD when chunking

Data Chunking– Speeding it up using SSD instead of normal HDD

The Good Enough Stuff…that we have no time for

Multicore

Multicore Processing – Revolution Rlibrary(foreach)library(doSNOW)cluster <-makeCluster(3, type = "SOCK")registerDoSNOW(cluster)setMKLthreads(1)

predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE)

train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") }result<-rowMeans(predictions)

Multicore Processing – Why it matters

License Cost (Usually charge by per CPU)1 CPU with 4 core = 1 single user license

Distributed 4 CPUS with 1 core each = 4 license or group license

Performance Improvement~2 x performance for 3 core vs 1 core

Visualization

Good Enough ReferencesRandom Forest•Obtaining knowledge from a random forest•Suggestions for speeding up Random Forests•Random Forest with classes that are very unbalanced

GBM•Define boosting•Generalized Boosted Models:A guide to the gbm package•What are some useful guidelines for GBM parameters?•R gbm logistic regression•How to win the KDD Cup Challenge with R and gbm

Ensembles•Ensemble learning introduction•Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets•Resources for learning how to implement ensemble methods•Ensemble methods•Intro to ensemble learning in R•Predictive analytics & decision tree

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data, scaled on cloud services

Qns? Email me @ thiakx@gmail.comLinkedIn ProfileKaggle Profile

Good Enough AnalyticsBig data analytics using cost efficient tools

and good enough ensemble of models based on optimized data, scaled on cloud services

Asia?

•Slide 2: http://3.bp.blogspot.com/-nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG•Slide 3: http://www.salesmanagementmastery.com/wp-content/uploads/2010/09/money-flying.jpg•Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg•Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html•Slide 7: http://2.bp.blogspot.com/-Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg•Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg•Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg•Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank-icon.jpg•Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page•Slide 19-21: www.scholarpedia.org•Slide 23/25: www.wikipedia.org•Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg•Slide 63: www.kaggle.com

Photo Credits