2014 09 30_sparkling_water_hands_on

Post on 28-Nov-2014

How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark. By Michal Malohlava and H2O.ai Our 100th Meetup at 0xdata, September 30, 2014 Open Source meets Out Door.

Sparkling Water“Killer App for Spark”

@hexadata & @mmalohlava presents

Spark and H2OSeveral months ago…

Sparkling WaterBefore

Tachyon based

Unnecessary data duplication



Transparent use of H2O data and algorithms with Spark API

Sparkling Water

DataFrame mutable"world

Sparkling Water

����� ���


����� ���

�� RDD DataFrame

Sparkling Water Design

Sparkling App

jar file

Spark Master JVM


Spark Worker


Spark Worker


Spark Worker


Sparkling Water Cluster

Spark Executor JVM


Spark Executor JVM


Spark Executor JVM


Data Distribution




Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)


Spark Executor JVM

Spark Executor JVM

Spark RDD

Hands-on Time


Load&Parse CSV data

Use Spark API, do SQL query

Create Deep Learning model

Use model for prediction


Linux or Mac OS X

Oracle Java 1.7

Virtual image is provided

for Windows users


Install and Launch

Unpack zip fileorOpen provided virtual image in VirtualBox

and Launch h2o-examples/sparkling-shell

What is Sparkling Shell?

Standard spark-shell

Launch H2O extension

export MASTER=“local-cluster[3,2,1024]” !spark-shell \ —jars shaded.jar \ —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension

JAR containing H2O code

Name of H2O extension provided by JAR

Spark Master address

…more on launching…

‣ By default single JVM, multi-threaded (export MASTER=local[*]) or

‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or

‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”

Lets play with Sparking shell…

Create H2O Client

import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)

Is Spark Running?http://localhost:4040

Is H2O running?http://localhost:54321/steam/index.html

DataLoad some data and parse them

import java.io.Fileimport org.apache.spark.examples.h2o._import org.apache.spark.h2o._val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz" !// Create DataFrame - involves parse of dataval airlinesData = new DataFrame(new File(dataFile))

Where are data?Go to http://localhost:54321/steam/index.html

Use Spark API// H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc)import h2oContext._

// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count

// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter( f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count

Use Spark SQLimport org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._ airlinesTable.registerTempTable("airlinesTable")

val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tablesresult.count

assert(result.count == flightsOnlyToSF.count)

Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name

// Create a new model builder val dl = new DeepLearning(dlParams)

val dlModel = dl.train.get

Make a prediction

// Use model to score data val prediction = dlModel.score(result)(‘predict) !// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

What is under the hood?

Spark App Extension/** Notion of Spark application platform extension. */trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s"$desc@$intercept" } /** Supported interception points. * * Currently only Executor life cycle is supported. */object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value}

Using App Extensions

val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloudconf.set(“spark.h2o.cluster.size”,h2oWorkers) !// Add H2O extensionconf.addExtension[H2OPlatformExtension] !// Create Spark Context val sc = new SparkContext(sc)

Spark Changes

We keep them small (~30 lines of code)

JIRA SPARK-3270 - Platform App Extensions


You can participate!Epic PUBDEV-21aka Sparkling Water

PUBDEV-23 Test HDFS reader

PUBDEV-26 Implement toSchemaRDD

PUBDEV-27 Boolean transfers

PUBDEV-31 Support toRDD[ X <: Numeric]

PUBDEV-32/33 Mesos/YARN support

