Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

25
CONFIDENTIAL Ken Stanford and Fonda Ingram July 25, 2016 Open Source or Closed Source

Transcript of Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Page 1: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

CONFIDENTIAL

Ken Stanford and Fonda IngramJuly 25, 2016

Open Source or Closed Source

Page 2: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Trending Now?

2015 SAS vs. R Survey Results – Burch Works

Page 3: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Trending Now?

• Buy vs Buildo Business are: “Engingeering/Technology/Innovation

companies”• Build yourself and potentially sell

• Companies are hesitant to make a LARGE upfront Capital Investment before they see proven value

Page 4: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Analytics POV Lifecycle

Discovery• Determine Business

Objective• Determine Modeling

Goal

Data Prep -Understanding• Data Collection• Data Exploration• Data Quality• Data Transformation

Model Building• Build Models• Model Assessment

Evaluation• Model Performance• Success criteria

Deployment• Monitoring and

maintenance• Model Management

Page 5: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Why Open Source?

Page 6: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

W hy u s e O p e n S o u rc e ?

• Reduce vendor dependencyo Run the program for any purposeo Customize the program - use cutting edge analytics

NOW• Reduce cost

o Freedom does not imply FREE• Responsive and Competitive

o Innovate in Real Timeo Rebuild in-house expertise and regain control

Page 7: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Why use H20?

• Capital Investment upfront is minimalo Download H20 – use it and continue to learn, once you mature we can help

you• Algorithms and Accuracy

o Distributed implementation of cutting edge ML algorithms• Building components that touch all facets of the Analytics POV Lifecycle• Flexible API available in R, Python, Scala, REST/JSON• Community driven

Page 8: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Why customers hesitate?

• Difficult to convert all SAS software to open source and keep my sanity ..

• I have been a SAS programmer for years..• What I have is working – why change..• I need a throat to choke if something goes wrong..• I like long product install times..• No one gets fired for buying SAS..

Page 9: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

What do I need to start?

• Migration Strategyo Analytical Tool? ( R or Python..)o Analytics Platform? (Hadoop, S3, etc.)

• Start small and get your feet wet (with H20)• May need to create a hybrid environment

Page 10: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

How to get started?

• Existing Use Case o Review data requirements• Get your data into H2O

o Select existing model to migrate• Identify algorithms – start small• Transition should be gradual

Page 11: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Language Translation

SAS H20-Rdataset dataframeobservation rowvariable columnBY-Group By functionif else H20.ifelse. (missing value) na (missing value)

Page 12: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

How to Import data?

Export SAS dataset to CSV file

proc export data=work.Wheaderdataset outfile='/folders/myfolders/wheader.csv' dbms=csvreplace;run;

Import to H20

library(h2o)h2o.init()h2odf = h2odf = h2o.importFile(path = "h2o/data/iris_wheader.csv")stopifnot(nrow(h2odf) == 150)

Page 13: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Munging – How to s l ice columns?Slicing Columns in a

SAS dataset/* Slice 1 column SepalLength - keep or drop */data iris2; set sashelp.iris; keep SepalLength;run;

/* Slice all columns except Species – keep or drop */ data iris2; set sashelp.iris; keep PetalLength PetalWidth SepalLength SepalWidth;run;

Slicing Columnsin a H20 dataframe

# slice 1 column by namec1_1 <- h2odf[, "sepal_len"]

# slice cols by vector of namescols_1 <- h2odf[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]

Page 14: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Munging – How to s l ice rows?Slicing Rows in a

SAS dataset/* Slicing obs 15 from a SAS dataset */data subset1; set sashelp.iris (firstobs=15 obs= 15);run;

/* Slicing a range of obs from a SAS dataset */data subset2; set sashelp.iris (firstobs=25 obs= 49);run;run;

Slicing Rows from a H20 dataframe

# slice 1 row by indexc1 <- h2odf[15,]

# slice a range of rowsc1_1 <- h2odf[25:49,]

Page 15: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Munging – How to s l ice rows?Slicing Rows in a

SAS dataset/* Slicing with a value */data subset3; set sashelp.iris; if SepalLength > 50;run;

/* Filter out missing values from a SAS dataset*/data subset3; set sashelp.iris; if SeptalLenght = . then delete;run;

Slicing Rows from a H20 dataframe

# slice with a boolean maskmask <- h2odf[,"sepal_len"] < 4.4cols <- h2odf[mask,]

# filter out missing valuesmask <- is.na(h2odf[,"sepal_len"])cols <- h2odf[!mask,]

Page 16: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Munging – How to replacing values?Replacing values in a

SAS dataset/* Replace a single numerical datum */ data iris ; obsnum = 15; modify iris point= obsnum; SepalWidth = 2; replace; stop;run;

/* Replace a whole column*/data iris ; modify iris; SepalWidth = SepalWidth * 3; replace;run;

Replacing values in a H20 dataframe

# replace a single numerical datumh2odf[15,3] <- 2

# replace a whole columnh2odf[,1] <- 3*h2odf[,1]

Page 17: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Munging – How to replacing values?Replacing values in a

SAS dataset/* replacement with if */data iris1 ;modify iris1;if SepalLenght < 4.4 then SeptalLenght = 22;replace;run;

/*Replace missing values with 0*/data iris1 ;modify iris1;if SepalLenght = . then SeptalLenght = 0;replace;run;

Replacing values in a H20 dataframe

# replacement with ifelseh2odf[,"sepal_len"] <- h2o.ifelse(h2odf[,"sepal_len"] < 4.4, 22, h2odf[,"sepal_len"])

# replace missing values with 0h2odf[is.na(h2odf[,"sepal_len"]), "sepal_len"] <- 0

Page 18: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes • Distributed Random Forest:

Classification or regression models• Gradient Boosting Machine: Produces

an ensemble of decision trees with increasing refined approximations

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis

Page 19: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means: Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

Page 20: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

GLM proc glmproc regproc logisticproc genmod

glmnetlm

h2o.glm

PCA proc princomp princomp h2o.prcomp

Factor Analysis proc factor factanalfactor.pa

SVD proc hptmineproc hpsvm

svd h2o.svd

Clustering proc fastclusproc hpclus

kmeans h2o.kmeans

Random Forest proc hpforest (EM Node) randomforest h2o.randomForest

Page 21: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Gradient Boosting proc arboretum gbm h2o.gbm

Neural Networks proc hpneural (EM Node)autoneural (EM node)proc neuralproc dmneural

nnet h2o.deeplearning

Ensemble (Stacking) proc ensemble (EM Node) h2o.ensemble, h2o.metalearn, h2o.stack (in dev)

GLRM (Cluster Analysis, Recommendation Engines)

NA NA h2o.glrm

Gradient Boosting proc arboretum gbm h2o.gbm

Kernel Density Estimation proc kde density

Page 22: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Variable Clustering proc varclus varclus

ARIMA proc arima arima

Autoregressive Models proc autoreg ar

Correlation proc corr corr

Survival Models proc phreg coxph Not currently available -- h2o.coxph

Linear Mixed Effects Models

proc mixedglimmix

lme

Page 23: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Summary proc summaryproc means

summarymeanmedianmax/minquantilevariance

Grouping/ Sort/ Rank proc sortproc rank

aggregateddplyorderdatatable

Exploratory Data Analysis proc univariateproc hpbin

moments (package)histecdfqqnormpnorm

Page 24: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )

Algorithm SAS R H2O

Plots gplotsgplot

ggplot2ggivsrglhtmlwidgets

Sampling proc surveyselect runifsample

h2o.runif

Page 25: Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford

Questions