Sparkling Water Webinar October 29th, 2014

OCT 29th, 2014 WEBINAR

H2O

Fast. Scalable. Machine LearningFor Smarter Applications

“Fluids are In, Animals are Out.”

~ Svetlana Sicular, Gartner

SpeakersJoel Horwitz

Joel is a caffeine, data, and laughter driven product strategist. He is an active community member having founded Bay Area Analytics, Tweets regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks regularly at industry events. Always eager to learn and lend a helping hand makes him an invaluable asset to 0xdata.

Michal Malohlava

Michal is a geek, developer, Java, Linux, programming languages enthusiast developing software for over 10 years. He obtained PhD from the Charles University in Prague in 2012 and post-doc at Purdue University.

H2O World Register at http://www.0xdata.com/h2o-world

http://www.0xdata.com/h2o-world

Time is the only non-renewable resource.

Speed Matters!

Today

• Why Are We Here

• Who We Are

• How Do We Do It

• Who We Work With

• What We Believe

• Demo and Q&A

A New Interpretation of Moore’s Law

“Like the physical universe, the digital universe is large - by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe -

the data we create and copy annually - will reach 44 zettabytes, or 44 trillion gigabytes.”

- IDC 2014

An Evolving Market to Meet the Demand

RDBMS MPP

BusinessIntelligence

Data Science

MachineLearning

H2O

DistributedFile System

Decreasing Cost of Data is Driving Demand

$ / GB (-50% every 18 months)

Algorithm Sophistication

1970 1980 1990 2000 2010 2020

H2O

H2O is the First DedicatedMachine Learning Open Source PlatformH2O is for application developers and analysts who

need scalable and fast machine learning. H2O is an

open source predictive analytics platform. Unlike

traditional analytics tools, H2O provides a combination

of extraordinary math, a high performance parallel

architecture, and unrivaled ease of use.

Who Are We?

H2O

H2O Awards and Accolades

• Top R Project of UserR Conference 2014

• Fortune Big Data All-Stars 2014, Arno Candel

• 100+ Meetups

• 6000+ Users

H2O is Built for Speed and Scale

• OpenSource

• REST API

• Native R Support

• NanoFastTM Scoring Engine

• Sophisticated Algorithms

H2O Seamlessly Integrates with Your Workflow

• 20X Faster Imports and 3X Compression w/ .hex format.

• 4 Billion Row Regression in Seconds.

• Deploy in POJO or with our REST API

Code is incomplete without Community!

Open Source Drives Innovation.

Law of Large Numbers Triumphs!

Every Generation Needs to Invent Its Own Math.

ML is the new SQL!

What do our customers say about us?"The platform can generate Jar files to deploy models into production. This alone is a milestone!" - Hassan Namarvar, ShareThis

“I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.” – Nachum Shacham, Paypal

“I analyzed 1 million rows training set, fitting a logistic regression with elastic penalty, and doing a grid search on parameters with 10-fold cross validation for each parameter combination… doing this analysis was a breeze… orders of magnitude faster than R.” - Antonio Molins, Netflix

“Never have we had such a quick, simple, scalable and cost effective deployment solution for predictive modeling.” – Lou Carvalheira, Cisco

AdvertisingBetter Conversions

Conversion ReachBrand ROI

Overall, I would say that the H2O platform is the most elegant open source in-memory ML engine.

~ Hassan Namarvar, Principal Data Scientist

FraudBetter Detections

Purchase

I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.

~ Nachum Shacham, Principal Data Scientist

Shopping Theft Passwords

MarketingBetter Spend

ROI

H2O has established a new equilibrium point for performance,accuracy and cost for statistics and machine learning.

~ Lou Carvalheira, Principal Data Scientist

Network Segments Measure

Select Customers Powered by H2O

Sparkling Water“Killer App for Spark”

@hexadata & @mmalohlavapresents

Memory efficient

Performance of computation

Machine learning algorithms

Parser, GUI, R-interface

User-friendly API

Large and active community

Platform components - SQL

Multitenancy

Sparkling Water

+RDD

immutableworld

DataFramemutableworld

Sparkling Water

RDD DataFrame

Sparkling Water

Provides

Transparent integration into Spark ecosystem

Pure H2ORDD encapsulating H2O DataFrame

Transparent use of H2O data structures and algorithms with Spark API

Excels in Spark workflows requiring advanced Machine Learning algorithms

Sparkling Water Design

spark-submitSpark

MasterJVM

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Contains applicationand Sparkling Water

classes

Sparkling Appjar file

implements

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMDataSource

(e.g. HDFS)

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

RDDs and DataFramesshare same memory

space

Demo Time

Flight delay prediction

“Build a model using weather and flight data to predict

delays of flights arriving to Chicago O’Hare International

Airport”

Example Outline

Load & Parse CSV data from 2 data sources

Use Spark API to filter data, do SQL query for join

Create a regression model

Use model for delay prediction

Plot residual plot from R

Sparkling Water Requirements

Linux or Mac OS X

Oracle Java 1.7

Downloadhttp://0xdata.com/download/

Install and Launch

Unpack zip file

and

Point SPARK_HOME to your Spark installation

and

Launch h2o-examples/sparkling-shell

What is Sparkling Shell?

Standard spark-shell

With additional Sparkling Water classes

export MASTER=“local-cluster[3,2,1024]”

spark-shell \ —-jars sparkling-water.jar JAR containing

Sparkling Water

Spark Masteraddress

Lets play with Sparkling shell…

Create H2O Client

import org.apache.spark.h2o._import org.apache.spark.examples.h2o._

val h2oContext = new H2OContext(sc).start(3)import h2oContext._

Regular Spark contextprovided by Spark shell

Size of demandedH2O cloud

Contains implicit utility functions Demo specificclasses

Is Spark Running?Go to http://localhost:4040

Is H2O running?http://localhost:54321/steam/index.html

Load Data #1Load weather data into RDD

val weatherDataFile = “examples/smalldata/weather.csv"

val wrawdata = sc.textFile(weatherDataFile,3).cache()

val weatherTable = wrawdata.map(_.split(“,")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

Regular Spark API

Ad-hoc Parser

Weather Data

case class Weather( val Year : Option[Int], val Month : Option[Int], val Day : Option[Int], val TmaxF : Option[Int], // Max temperatur in F val TminF : Option[Int], // Min temperatur in F val TmeanF : Option[Float], // Mean temperatur in F val PrcpIn : Option[Float], // Precipitation (inches) val SnowIn : Option[Float], // Snow (inches) val CDD : Option[Float], // Cooling Degree Day val HDD : Option[Float], // Heating Degree Day val GDD : Option[Float]) // Growing Degree Day

Simple POSO to hold one row of weather data

Load Data #2Load flights data into H2O frame

import java.io.File

val dataFile = “examples/smalldata/allyears2k_headers.csv.gz"

val airlinesData = new DataFrame(new File(dataFile))

Where is the data?Go to http://localhost:54321/steam/index.html

Use Spark API for Data Filtering

// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines]

= toRDD[Airlines](airlinesData)

// And use Spark RDD API directlyval flightsToORD = airlinesTable

.filter( f => f.Dest == Some(“ORD") )

Regular SparkRDD call

Create a cheap wrapperaround H2O DataFrame

Use Spark SQL to Data Join

import org.apache.spark.sql.SQLContext// We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._

flightsToORD.registerTempTable("FlightsToORD")weatherTable.registerTempTable("WeatherORD")

Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel

.DeepLearningParameters

// Setup deep learning parametersval dlParams = new DeepLearningParameters()dlParams._train = bigTabledlParams._response_column = 'ArrDelaydlParams._classification = false

// Create a new model builderval dl = new DeepLearning(dlParams)

val dlModel = dl.train.get

Result of SQL query

Blocking call

Make a prediction

// Use model to score dataval prediction = dlModel.score(result)(‘predict)

// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

Generate Residuals Plot# Import H2O library and initialize H2O clientlibrary(h2o)

h = h2o.init()

# Fetch prediction and actual data, use remembered keyspred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")

# Select right columnspredDelay = pred$predictactDelay = act$ArrDelay

# Make sure that number of rows is same nrow(actDelay) == nrow(predDelay)

# Compute residuals residuals = predDelay - actDelay

# Plot residuals compare = cbind(

as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))

plot( compare[,1:2] )

Referencesof data

More infoCheckout 0xdata Blog for Sparkling Water tutorials

http://0xdata.com/blog/

Checkout 0xdata Youtube Channel

https://www.youtube.com/user/0xdata

Checkout github

https://github.com/0xdata/sparkling-water

Learn more about H2O at0xdata.com

or

Thank you!

Follow us at @hexadata

neo> for r in sparkling-water; do git clone “[email protected]:0xdata/$r.git”done

http://0xdata.com/

mailto:[email protected]

Sparkling Water Webinar October 29th, 2014

Technology

Transcript of Sparkling Water Webinar October 29th, 2014