Intro to Machine Learning and H2O
-
Upload
raymond-peck -
Category
Data & Analytics
-
view
69 -
download
3
Transcript of Intro to Machine Learning and H2O
Intro to Machine Learning and H2ORaymond PeckDirector of Product Engineering, H2O.ai
[email protected]© H2O.ai, 2016 1
What Will We Cover?• What is Machine Learning?
• Why H2O for Machine Learning?
• How-to
© H2O.ai, 2016 2
What is Machine Learning?• ”Field of study that gives computers the
ability to learn without being explicitly programmed.”— Arthur Samuel, 1959
• "Unlike rules-based systems which require a human expert to hardcode domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone."
• learns the structure of the data© H2O.ai, 2016 3
Types of Models• predictive models
• classifiers
• binary
• multinomial
• regression (continuous values)
• clustering
• anomaly detection© H2O.ai, 2016 4
Classification• multi-class or binary classification
• ranking (e.g. Google Search results order)
• evaluate with Classification Error or AUC
© H2O.ai, 2016 5
Regression• predict a real-valued response (e.g. viral load, price)
• response can have Gaussian, Gamma, Poisson, etc. distributions
• evaluate with MSE or R^2
© H2O.ai, 2016 6
Clustering• unsupervised learning (no training labels)
• partition the data; identify clusters or sub-populations
• evaluate with AIC, BIC or Total Sum of Squares
© H2O.ai, 2016 7
Anomaly Detection• unsupervised learning (no training labels)
• use autoencoder to learn the structure of the data
• find rows with highest predicted error
© H2O.ai, 2016 8
Some Use Cases(flip to another presentation)
© H2O.ai, 2016 9
Algorithms• GLM - Generalized Linear Models
• decision tree-based models
• GBM - Gradient Bossting Machines
• RF - Random Forest
• Deep Learning
• K-means
• many more© H2O.ai, 2016 10
Why H2O?• industry-leading model quality
• World record 99.1% accuracy on MNIST data
• world record speed
• 1B row logistic regression in 5s
• horizontally scalable for big data
• accessible via multiple languages and point-and-click
• easy to take models to production
• open source and extensible
© H2O.ai, 2016 11
Scientific Advisory CouncilDr. Trevor Hastie
• John A. Overdeck Professor of Mathematics, Stanford University
• PhD in Statistics, Stanford University
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Co-author with John Chambers, Statistical Models in S
• Co-author, Generalized Additive Models
• 108,404 citations (via Google Scholar)© H2O.ai, 2016 12
Scientific Advisory CouncilDr. Rob Tibshirani
• Professor of Statistics and Health Research and Policy, Stanford University
• PhD in Statistics, Stanford University
• COPPS Presidents’ Award recipient
• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining
• Author, Regression Shrinkage and Selection via the Lasso
• Co-author, An Introduction to the Bootstrap
© H2O.ai, 2016 13
Scientific Advisory CouncilDr. Stephen Boyd
• Professor of Electrical Engineering and Computer Science, Stanford University
• PhD in Electrical Engineering and Computer Science, UC Berkeley
• Co-author, Convex Optimization
• Co-author, Linear Matrix Inequalities in System and Control Theory
• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
© H2O.ai, 2016 14
© H2O.ai, 2016 15
© H2O.ai, 2016 16
Languages• R
• Python
• Scala
• Java
• point and click (Flow)
• C# and Go via REST interface
© H2O.ai, 2016 17
R r <- h2o.runif(data$Days,seed=1234) train <- data[r < 0.6,] test <- data[(r >= 0.6) & (r < 0.9),]
myY <- "bike_count" myX <- setdiff(names(train), myY)
# Run GBM gbm <- h2o.gbm(x = myX, y = myY, training_frame = train, validation_frame = test, ntrees = 500, max_depth = 6, learn_rate = 0.1)
© H2O.ai, 2016 18
Pythonr = data['Days'].runif() # Random UNIForm numbers, one per rowtrain = data[ r < 0.6]test = data[(0.6 <= r) & (r < 0.9)]bike_names_x = data.namesbike_names_x.remove(\"bikes\")
gbm0 = H2OGradientBoostingEstimator(ntrees=500, max_depth=6, learn_rate=0.1)
gbm0.train(x = bike_names_x, y = \"bikes\", training_frame = train, validation_frame = test)
© H2O.ai, 2016 19
Scala val (allDataFrame, w2vModel) = createH2OFrame(datafile) val frs = DemoUtils.splitFrame(allDataFrame, Array("train.hex", "valid.hex"), Array(0.8, 0.2)) val (trainFrame, validFrame) = (h2oContext.asH2OFrame(frs(0)), h2oContext.asH2OFrame(frs(1)))
val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50)
© H2O.ai, 2016 20
© H2O.ai, 2016 21
Advanced Topics in Model Building• hyperparameter search
• Cartesian
• random with metric-based early stopping
• ensembles
• Super Learner
© H2O.ai, 2016 22
Going to Production: Model Training• automation of the training process
• Python, Scala, Java, R, C#, Go
© H2O.ai, 2016 23
Going to Production: predict()• all H2O models can be output as POJOs (Plain Old Java
Objects)
• POJOs stand alone: no H2O runtime is required
• can support other target languages if needed
© H2O.ai, 2016 24
class Grid_GBM_arrhythmia_hex_model_safari_1465838195588_2_model_2_Tree_0_class_0 { static final double score0(double[] data) { double pred = (data[2 /* C4 */] <44.5f ? (data[261 /* C280 */] <1.5f ? -1.9671239f : (data[176 /* C192 */] <1.4980469f ? -3.1304572f : (data[251 /* C269 */] <24.328125f ? -4.4071236f : -3.780457f))) : (data[195 /* C212 */] <6.354785f ? (data[151 /* C167 */] <1.8351562f ? (data[223 /* C240 */] <0.8359375f ? (data[225 /* C242 */] <7.9875f ? (data[164 /* C180 */] <-0.034375f ? (data[205 /* C222 */] <0.615f ?
© H2O.ai, 2016 25
Where to learn more?• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Intelligence H2O Booklets: https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/booklets
© H2O.ai, 2016 26