Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
-
Upload
jo-fai-chow -
Category
Data & Analytics
-
view
432 -
download
2
Transcript of Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
CONFIDENTIAL
Ken Stanford and Fonda IngramJuly 25, 2016
Open Source or Closed Source
Trending Now?
2015 SAS vs. R Survey Results – Burch Works
Trending Now?
• Buy vs Buildo Business are: “Engingeering/Technology/Innovation
companies”• Build yourself and potentially sell
• Companies are hesitant to make a LARGE upfront Capital Investment before they see proven value
Analytics POV Lifecycle
Discovery• Determine Business
Objective• Determine Modeling
Goal
Data Prep -Understanding• Data Collection• Data Exploration• Data Quality• Data Transformation
Model Building• Build Models• Model Assessment
Evaluation• Model Performance• Success criteria
Deployment• Monitoring and
maintenance• Model Management
Why Open Source?
W hy u s e O p e n S o u rc e ?
• Reduce vendor dependencyo Run the program for any purposeo Customize the program - use cutting edge analytics
NOW• Reduce cost
o Freedom does not imply FREE• Responsive and Competitive
o Innovate in Real Timeo Rebuild in-house expertise and regain control
Why use H20?
• Capital Investment upfront is minimalo Download H20 – use it and continue to learn, once you mature we can help
you• Algorithms and Accuracy
o Distributed implementation of cutting edge ML algorithms• Building components that touch all facets of the Analytics POV Lifecycle• Flexible API available in R, Python, Scala, REST/JSON• Community driven
Why customers hesitate?
• Difficult to convert all SAS software to open source and keep my sanity ..
• I have been a SAS programmer for years..• What I have is working – why change..• I need a throat to choke if something goes wrong..• I like long product install times..• No one gets fired for buying SAS..
What do I need to start?
• Migration Strategyo Analytical Tool? ( R or Python..)o Analytics Platform? (Hadoop, S3, etc.)
• Start small and get your feet wet (with H20)• May need to create a hybrid environment
How to get started?
• Existing Use Case o Review data requirements• Get your data into H2O
o Select existing model to migrate• Identify algorithms – start small• Transition should be gradual
Language Translation
SAS H20-Rdataset dataframeobservation rowvariable columnBY-Group By functionif else H20.ifelse. (missing value) na (missing value)
How to Import data?
Export SAS dataset to CSV file
proc export data=work.Wheaderdataset outfile='/folders/myfolders/wheader.csv' dbms=csvreplace;run;
Import to H20
library(h2o)h2o.init()h2odf = h2odf = h2o.importFile(path = "h2o/data/iris_wheader.csv")stopifnot(nrow(h2odf) == 150)
Munging – How to s l ice columns?Slicing Columns in a
SAS dataset/* Slice 1 column SepalLength - keep or drop */data iris2; set sashelp.iris; keep SepalLength;run;
/* Slice all columns except Species – keep or drop */ data iris2; set sashelp.iris; keep PetalLength PetalWidth SepalLength SepalWidth;run;
Slicing Columnsin a H20 dataframe
# slice 1 column by namec1_1 <- h2odf[, "sepal_len"]
# slice cols by vector of namescols_1 <- h2odf[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]
Munging – How to s l ice rows?Slicing Rows in a
SAS dataset/* Slicing obs 15 from a SAS dataset */data subset1; set sashelp.iris (firstobs=15 obs= 15);run;
/* Slicing a range of obs from a SAS dataset */data subset2; set sashelp.iris (firstobs=25 obs= 49);run;run;
Slicing Rows from a H20 dataframe
# slice 1 row by indexc1 <- h2odf[15,]
# slice a range of rowsc1_1 <- h2odf[25:49,]
Munging – How to s l ice rows?Slicing Rows in a
SAS dataset/* Slicing with a value */data subset3; set sashelp.iris; if SepalLength > 50;run;
/* Filter out missing values from a SAS dataset*/data subset3; set sashelp.iris; if SeptalLenght = . then delete;run;
Slicing Rows from a H20 dataframe
# slice with a boolean maskmask <- h2odf[,"sepal_len"] < 4.4cols <- h2odf[mask,]
# filter out missing valuesmask <- is.na(h2odf[,"sepal_len"])cols <- h2odf[!mask,]
Munging – How to replacing values?Replacing values in a
SAS dataset/* Replace a single numerical datum */ data iris ; obsnum = 15; modify iris point= obsnum; SepalWidth = 2; replace; stop;run;
/* Replace a whole column*/data iris ; modify iris; SepalWidth = SepalWidth * 3; replace;run;
Replacing values in a H20 dataframe
# replace a single numerical datumh2odf[15,3] <- 2
# replace a whole columnh2odf[,1] <- 3*h2odf[,1]
Munging – How to replacing values?Replacing values in a
SAS dataset/* replacement with if */data iris1 ;modify iris1;if SepalLenght < 4.4 then SeptalLenght = 22;replace;run;
/*Replace missing values with 0*/data iris1 ;modify iris1;if SepalLenght = . then SeptalLenght = 0;replace;run;
Replacing values in a H20 dataframe
# replacement with ifelseh2odf[,"sepal_len"] <- h2o.ifelse(h2odf[,"sepal_len"] < 4.4, 22, h2odf[,"sepal_len"])
# replace missing values with 0h2odf[is.na(h2odf[,"sepal_len"]), "sepal_len"] <- 0
Ensembles
Deep Neural Networks
Algorithms on H2O
• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes • Distributed Random Forest:
Classification or regression models• Gradient Boosting Machine: Produces
an ensemble of decision trees with increasing refined approximations
• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Supervised Learning
Statistical Analysis
Dimensionality Reduction
Anomaly Detection
Algorithms on H2O
• K-means: Partitions observations into k clusters/groups of the same spatial size
• Principal Component Analysis: Linearly transforms correlated variables to independent components
• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
Unsupervised Learning
Clustering
M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )
Algorithm SAS R H2O
GLM proc glmproc regproc logisticproc genmod
glmnetlm
h2o.glm
PCA proc princomp princomp h2o.prcomp
Factor Analysis proc factor factanalfactor.pa
SVD proc hptmineproc hpsvm
svd h2o.svd
Clustering proc fastclusproc hpclus
kmeans h2o.kmeans
Random Forest proc hpforest (EM Node) randomforest h2o.randomForest
M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )
Algorithm SAS R H2O
Gradient Boosting proc arboretum gbm h2o.gbm
Neural Networks proc hpneural (EM Node)autoneural (EM node)proc neuralproc dmneural
nnet h2o.deeplearning
Ensemble (Stacking) proc ensemble (EM Node) h2o.ensemble, h2o.metalearn, h2o.stack (in dev)
GLRM (Cluster Analysis, Recommendation Engines)
NA NA h2o.glrm
Gradient Boosting proc arboretum gbm h2o.gbm
Kernel Density Estimation proc kde density
M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )
Algorithm SAS R H2O
Variable Clustering proc varclus varclus
ARIMA proc arima arima
Autoregressive Models proc autoreg ar
Correlation proc corr corr
Survival Models proc phreg coxph Not currently available -- h2o.coxph
Linear Mixed Effects Models
proc mixedglimmix
lme
M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )
Algorithm SAS R H2O
Summary proc summaryproc means
summarymeanmedianmax/minquantilevariance
Grouping/ Sort/ Rank proc sortproc rank
aggregateddplyorderdatatable
Exploratory Data Analysis proc univariateproc hpbin
moments (package)histecdfqqnormpnorm
M o d e l i n g te c h n i q u e s t o h e l p y o u a n a l y ze d a t a ( S A S , H 2 O a n d R )
Algorithm SAS R H2O
Plots gplotsgplot
ggplot2ggivsrglhtmlwidgets
Sampling proc surveyselect runifsample
h2o.runif
Questions