Sparkling Water Webinar October 29th, 2014

50
OCT 29th, 2014 WEBINAR H 2 O st. Scalable. Machine Learning r Smarter Applications “Fluids are In, Animals are Out.” Svetlana Sicular, Gartner

Transcript of Sparkling Water Webinar October 29th, 2014

Page 1: Sparkling Water Webinar October 29th, 2014

OCT 29th, 2014 WEBINAR

H2O

Fast. Scalable. Machine LearningFor Smarter Applications

“Fluids are In, Animals are Out.”

~ Svetlana Sicular, Gartner

Page 2: Sparkling Water Webinar October 29th, 2014

SpeakersJoel Horwitz

Joel is a caffeine, data, and laughter driven product strategist. He is an active community member having founded Bay Area Analytics, Tweets regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks regularly at industry events. Always eager to learn and lend a helping hand makes him an invaluable asset to 0xdata.

Michal Malohlava

Michal is a geek, developer, Java, Linux, programming languages enthusiast developing software for over 10 years. He obtained PhD from the Charles University in Prague in 2012 and post-doc at Purdue University.

H2O World Register at http://www.0xdata.com/h2o-world

Page 3: Sparkling Water Webinar October 29th, 2014

Time is the only non-renewable resource.

Speed Matters!

Page 4: Sparkling Water Webinar October 29th, 2014

Today

• Why Are We Here

• Who We Are

• How Do We Do It

• Who We Work With

• What We Believe

• Demo and Q&A

Page 5: Sparkling Water Webinar October 29th, 2014

A New Interpretation of Moore’s Law

“Like the physical universe, the digital universe is large - by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe -

the data we create and copy annually - will reach 44 zettabytes, or 44 trillion gigabytes.”

- IDC 2014

Page 6: Sparkling Water Webinar October 29th, 2014

An Evolving Market to Meet the Demand

RDBMS MPP

BusinessIntelligence

Data Science

MachineLearning

H2O

DistributedFile System

Page 7: Sparkling Water Webinar October 29th, 2014

Decreasing Cost of Data is Driving Demand

$ / GB (-50% every 18 months)

Algorithm Sophistication

1970 1980 1990 2000 2010 2020

H2O

Page 8: Sparkling Water Webinar October 29th, 2014

H2O is the First DedicatedMachine Learning Open Source PlatformH2O is for application developers and analysts who

need scalable and fast machine learning. H2O is an

open source predictive analytics platform. Unlike

traditional analytics tools, H2O provides a combination

of extraordinary math, a high performance parallel

architecture, and unrivaled ease of use.

Page 9: Sparkling Water Webinar October 29th, 2014

Who Are We?

H2O

Page 10: Sparkling Water Webinar October 29th, 2014

H2O Awards and Accolades

• Top R Project of UserR Conference 2014

• Fortune Big Data All-Stars 2014, Arno Candel

• 100+ Meetups

• 6000+ Users

Page 11: Sparkling Water Webinar October 29th, 2014

H2O is Built for Speed and Scale

• OpenSource

• REST API

• Native R Support

• NanoFastTM Scoring Engine

• Sophisticated Algorithms

Page 12: Sparkling Water Webinar October 29th, 2014

H2O Seamlessly Integrates with Your Workflow

• 20X Faster Imports and 3X Compression w/ .hex format.

• 4 Billion Row Regression in Seconds.

• Deploy in POJO or with our REST API

Page 13: Sparkling Water Webinar October 29th, 2014

Code is incomplete without Community!

Open Source Drives Innovation.

Page 14: Sparkling Water Webinar October 29th, 2014

Law of Large Numbers Triumphs!

Page 15: Sparkling Water Webinar October 29th, 2014

Every Generation Needs to Invent Its Own Math.

ML is the new SQL!

Page 16: Sparkling Water Webinar October 29th, 2014

What do our customers say about us?"The platform can generate Jar files to deploy models into production. This alone is a milestone!" - Hassan Namarvar, ShareThis

“I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.” – Nachum Shacham, Paypal

“I analyzed 1 million rows training set, fitting a logistic regression with elastic penalty, and doing a grid search on parameters with 10-fold cross validation for each parameter combination… doing this analysis was a breeze… orders of magnitude faster than R.” - Antonio Molins, Netflix

“Never have we had such a quick, simple, scalable and cost effective deployment solution for predictive modeling.” – Lou Carvalheira, Cisco

Page 17: Sparkling Water Webinar October 29th, 2014

AdvertisingBetter Conversions

Conversion ReachBrand ROI

Overall, I would say that the H2O platform is the most elegant open source in-memory ML engine.

~ Hassan Namarvar, Principal Data Scientist

Page 18: Sparkling Water Webinar October 29th, 2014

FraudBetter Detections

Purchase

I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.

~ Nachum Shacham, Principal Data Scientist

Shopping Theft Passwords

Page 19: Sparkling Water Webinar October 29th, 2014

MarketingBetter Spend

ROI

H2O has established a new equilibrium point for performance,accuracy and cost for statistics and machine learning.

~ Lou Carvalheira, Principal Data Scientist

Network Segments Measure

Page 20: Sparkling Water Webinar October 29th, 2014

Select Customers Powered by H2O

Page 21: Sparkling Water Webinar October 29th, 2014

Sparkling Water“Killer App for Spark”

@hexadata & @mmalohlavapresents

Page 22: Sparkling Water Webinar October 29th, 2014

Memory efficient

Performance of computation

Machine learning algorithms

Parser, GUI, R-interface

User-friendly API

Large and active community

Platform components - SQL

Multitenancy

Page 23: Sparkling Water Webinar October 29th, 2014

Sparkling Water

+RDD

immutableworld

DataFramemutableworld

Page 24: Sparkling Water Webinar October 29th, 2014

Sparkling Water

RDD DataFrame

Page 25: Sparkling Water Webinar October 29th, 2014

Sparkling Water

Provides

Transparent integration into Spark ecosystem

Pure H2ORDD encapsulating H2O DataFrame

Transparent use of H2O data structures and algorithms with Spark API

Excels in Spark workflows requiring advanced Machine Learning algorithms

Page 26: Sparkling Water Webinar October 29th, 2014

Sparkling Water Design

spark-submitSpark

MasterJVM

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Contains applicationand Sparkling Water

classes

Sparkling Appjar file

implements

Page 27: Sparkling Water Webinar October 29th, 2014

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMDataSource

(e.g. HDFS)

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

RDDs and DataFramesshare same memory

space

Page 28: Sparkling Water Webinar October 29th, 2014

Demo Time

Page 29: Sparkling Water Webinar October 29th, 2014

Flight delay prediction

“Build a model using weather and flight data to predict

delays of flights arriving to Chicago O’Hare International

Airport”

Page 30: Sparkling Water Webinar October 29th, 2014

Example Outline

Load & Parse CSV data from 2 data sources

Use Spark API to filter data, do SQL query for join

Create a regression model

Use model for delay prediction

Plot residual plot from R

Page 31: Sparkling Water Webinar October 29th, 2014

Sparkling Water Requirements

Linux or Mac OS X

Oracle Java 1.7

Page 32: Sparkling Water Webinar October 29th, 2014

Downloadhttp://0xdata.com/download/

Page 33: Sparkling Water Webinar October 29th, 2014

Install and Launch

Unpack zip file

and

Point SPARK_HOME to your Spark installation

and

Launch h2o-examples/sparkling-shell

Page 34: Sparkling Water Webinar October 29th, 2014

What is Sparkling Shell?

Standard spark-shell

With additional Sparkling Water classes

export MASTER=“local-cluster[3,2,1024]”

spark-shell \ —-jars sparkling-water.jar JAR containing

Sparkling Water

Spark Masteraddress

Page 35: Sparkling Water Webinar October 29th, 2014

Lets play with Sparkling shell…

Page 36: Sparkling Water Webinar October 29th, 2014

Create H2O Client

import org.apache.spark.h2o._import org.apache.spark.examples.h2o._

val h2oContext = new H2OContext(sc).start(3)import h2oContext._

Regular Spark contextprovided by Spark shell

Size of demandedH2O cloud

Contains implicit utility functions Demo specificclasses

Page 37: Sparkling Water Webinar October 29th, 2014

Is Spark Running?Go to http://localhost:4040

Page 38: Sparkling Water Webinar October 29th, 2014

Is H2O running?http://localhost:54321/steam/index.html

Page 39: Sparkling Water Webinar October 29th, 2014

Load Data #1Load weather data into RDD

val weatherDataFile = “examples/smalldata/weather.csv"

val wrawdata = sc.textFile(weatherDataFile,3).cache()

val weatherTable = wrawdata.map(_.split(“,")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

Regular Spark API

Ad-hoc Parser

Page 40: Sparkling Water Webinar October 29th, 2014

Weather Data

case class Weather( val Year : Option[Int], val Month : Option[Int], val Day : Option[Int], val TmaxF : Option[Int], // Max temperatur in F val TminF : Option[Int], // Min temperatur in F val TmeanF : Option[Float], // Mean temperatur in F val PrcpIn : Option[Float], // Precipitation (inches) val SnowIn : Option[Float], // Snow (inches) val CDD : Option[Float], // Cooling Degree Day val HDD : Option[Float], // Heating Degree Day val GDD : Option[Float]) // Growing Degree Day

Simple POSO to hold one row of weather data

Page 41: Sparkling Water Webinar October 29th, 2014

Load Data #2Load flights data into H2O frame

import java.io.File

val dataFile = “examples/smalldata/allyears2k_headers.csv.gz"

val airlinesData = new DataFrame(new File(dataFile))

Page 42: Sparkling Water Webinar October 29th, 2014

Where is the data?Go to http://localhost:54321/steam/index.html

Page 43: Sparkling Water Webinar October 29th, 2014

Use Spark API for Data Filtering

// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines]

= toRDD[Airlines](airlinesData)

// And use Spark RDD API directlyval flightsToORD = airlinesTable

.filter( f => f.Dest == Some(“ORD") )

Regular SparkRDD call

Create a cheap wrapperaround H2O DataFrame

Page 44: Sparkling Water Webinar October 29th, 2014

Use Spark SQL to Data Join

import org.apache.spark.sql.SQLContext// We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._

flightsToORD.registerTempTable("FlightsToORD")weatherTable.registerTempTable("WeatherORD")

Page 45: Sparkling Water Webinar October 29th, 2014

Join Data based on Flight Date

val bigTable = sql( """SELECT | f.Year,f.Month,f.DayofMonth, | f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime, | f.UniqueCarrier,f.FlightNum,f.TailNum, | f.Origin,f.Distance, | w.TmaxF,w.TminF,w.TmeanF, | w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD, | f.ArrDelay | FROM FlightsToORD f | JOIN WeatherORD w | ON f.Year=w.Year AND f.Month=w.Month | AND f.DayofMonth=w.Day""".stripMargin)

Page 46: Sparkling Water Webinar October 29th, 2014

Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel

.DeepLearningParameters

// Setup deep learning parametersval dlParams = new DeepLearningParameters()dlParams._train = bigTabledlParams._response_column = 'ArrDelaydlParams._classification = false

// Create a new model builderval dl = new DeepLearning(dlParams)

val dlModel = dl.train.get

Result of SQL query

Blocking call

Page 47: Sparkling Water Webinar October 29th, 2014

Make a prediction

// Use model to score dataval prediction = dlModel.score(result)(‘predict)

// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

Page 48: Sparkling Water Webinar October 29th, 2014

Generate Residuals Plot# Import H2O library and initialize H2O clientlibrary(h2o)

h = h2o.init()

# Fetch prediction and actual data, use remembered keyspred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")

# Select right columnspredDelay = pred$predictactDelay = act$ArrDelay

# Make sure that number of rows is same nrow(actDelay) == nrow(predDelay)

# Compute residuals residuals = predDelay - actDelay

# Plot residuals compare = cbind(

as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))

plot( compare[,1:2] )

Referencesof data

Page 49: Sparkling Water Webinar October 29th, 2014

More infoCheckout 0xdata Blog for Sparkling Water tutorials

http://0xdata.com/blog/

Checkout 0xdata Youtube Channel

https://www.youtube.com/user/0xdata

Checkout github

https://github.com/0xdata/sparkling-water

Page 50: Sparkling Water Webinar October 29th, 2014

Learn more about H2O at0xdata.com

or

Thank you!

Follow us at @hexadata

neo> for r in sparkling-water; do git clone “[email protected]:0xdata/$r.git”done