Intro to Machine Learning and H2O

Intro to Machine Learning and H2ORaymond PeckDirector of Product Engineering, H2O.ai

[email protected]© H2O.ai, 2016 1

What Will We Cover?• What is Machine Learning?

• Why H2O for Machine Learning?

• How-to

© H2O.ai, 2016 2

What is Machine Learning?• ”Field of study that gives computers the

ability to learn without being explicitly programmed.”— Arthur Samuel, 1959

• "Unlike rules-based systems which require a human expert to hardcode domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone."

• learns the structure of the data© H2O.ai, 2016 3

Types of Models• predictive models

• classifiers

• binary

• multinomial

• regression (continuous values)

• clustering

• anomaly detection© H2O.ai, 2016 4

Classification• multi-class or binary classification

• ranking (e.g. Google Search results order)

• evaluate with Classification Error or AUC

© H2O.ai, 2016 5

Regression• predict a real-valued response (e.g. viral load, price)

• response can have Gaussian, Gamma, Poisson, etc. distributions

• evaluate with MSE or R^2

© H2O.ai, 2016 6

Clustering• unsupervised learning (no training labels)

• partition the data; identify clusters or sub-populations

• evaluate with AIC, BIC or Total Sum of Squares

© H2O.ai, 2016 7

Anomaly Detection• unsupervised learning (no training labels)

• use autoencoder to learn the structure of the data

• find rows with highest predicted error

© H2O.ai, 2016 8

Some Use Cases(flip to another presentation)

© H2O.ai, 2016 9

Algorithms• GLM - Generalized Linear Models

• decision tree-based models

• GBM - Gradient Bossting Machines

• RF - Random Forest

• Deep Learning

• K-means

• many more© H2O.ai, 2016 10

Why H2O?• industry-leading model quality

• World record 99.1% accuracy on MNIST data

• world record speed

• 1B row logistic regression in 5s

• horizontally scalable for big data

• accessible via multiple languages and point-and-click

• easy to take models to production

• open source and extensible

© H2O.ai, 2016 11

Scientific Advisory CouncilDr. Trevor Hastie

• John A. Overdeck Professor of Mathematics, Stanford University

• PhD in Statistics, Stanford University

• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining

• Co-author with John Chambers, Statistical Models in S

• Co-author, Generalized Additive Models

• 108,404 citations (via Google Scholar)© H2O.ai, 2016 12

Scientific Advisory CouncilDr. Rob Tibshirani

• Professor of Statistics and Health Research and Policy, Stanford University

• PhD in Statistics, Stanford University

• COPPS Presidents’ Award recipient

• Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining

• Author, Regression Shrinkage and Selection via the Lasso

• Co-author, An Introduction to the Bootstrap

© H2O.ai, 2016 13

Scientific Advisory CouncilDr. Stephen Boyd

• Professor of Electrical Engineering and Computer Science, Stanford University

• PhD in Electrical Engineering and Computer Science, UC Berkeley

• Co-author, Convex Optimization

• Co-author, Linear Matrix Inequalities in System and Control Theory

• Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

© H2O.ai, 2016 14

Languages• R

• Python

• Scala

• Java

• point and click (Flow)

• C# and Go via REST interface

© H2O.ai, 2016 17

R r <- h2o.runif(data$Days,seed=1234) train <- data[r < 0.6,] test <- data[(r >= 0.6) & (r < 0.9),]

myY <- "bike_count" myX <- setdiff(names(train), myY)

# Run GBM gbm <- h2o.gbm(x = myX, y = myY, training_frame = train, validation_frame = test, ntrees = 500, max_depth = 6, learn_rate = 0.1)

© H2O.ai, 2016 18

Pythonr = data['Days'].runif() # Random UNIForm numbers, one per rowtrain = data[ r < 0.6]test = data[(0.6 <= r) & (r < 0.9)]bike_names_x = data.namesbike_names_x.remove(\"bikes\")

gbm0 = H2OGradientBoostingEstimator(ntrees=500, max_depth=6, learn_rate=0.1)

gbm0.train(x = bike_names_x, y = \"bikes\", training_frame = train, validation_frame = test)

© H2O.ai, 2016 19

Scala val (allDataFrame, w2vModel) = createH2OFrame(datafile) val frs = DemoUtils.splitFrame(allDataFrame, Array("train.hex", "valid.hex"), Array(0.8, 0.2)) val (trainFrame, validFrame) = (h2oContext.asH2OFrame(frs(0)), h2oContext.asH2OFrame(frs(1)))

val gbmModel = GBMModel(trainFrame, validFrame, "category", modelName, ntrees = 50)

© H2O.ai, 2016 20

Advanced Topics in Model Building• hyperparameter search

• Cartesian

• random with metric-based early stopping

• ensembles

• Super Learner

© H2O.ai, 2016 22

Going to Production: Model Training• automation of the training process

• Python, Scala, Java, R, C#, Go

© H2O.ai, 2016 23

Going to Production: predict()• all H2O models can be output as POJOs (Plain Old Java

Objects)

• POJOs stand alone: no H2O runtime is required

• can support other target languages if needed

© H2O.ai, 2016 24

class Grid_GBM_arrhythmia_hex_model_safari_1465838195588_2_model_2_Tree_0_class_0 { static final double score0(double[] data) { double pred = (data[2 /* C4 */] <44.5f ? (data[261 /* C280 */] <1.5f ? -1.9671239f : (data[176 /* C192 */] <1.4980469f ? -3.1304572f : (data[251 /* C269 */] <24.328125f ? -4.4071236f : -3.780457f))) : (data[195 /* C212 */] <6.354785f ? (data[151 /* C167 */] <1.8351562f ? (data[223 /* C240 */] <0.8359375f ? (data[225 /* C242 */] <7.9875f ? (data[164 /* C180 */] <-0.034375f ? (data[205 /* C222 */] <0.615f ?

© H2O.ai, 2016 25

Where to learn more?• H2O Online Training (free): http://learn.h2o.ai

• H2O Slidedecks: http://www.slideshare.net/0xdata

• H2O Video Presentations: https://www.youtube.com/user/0xdata

• H2O Community Events & Meetups: http://h2o.ai/events

• Machine Intelligence H2O Booklets: https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/booklets

© H2O.ai, 2016 26

Intro to Machine Learning and H2O

Data & Analytics

Transcript of Intro to Machine Learning and H2O