2014 09 30_sparkling_water_hands_on

31
Sparkling Water “Killer App for Spark” @hexadata & @mmalohlava presents

description

How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark. By Michal Malohlava and H2O.ai Our 100th Meetup at 0xdata, September 30, 2014 Open Source meets Out Door.

Transcript of 2014 09 30_sparkling_water_hands_on

Page 1: 2014 09 30_sparkling_water_hands_on

Sparkling Water“Killer App for Spark”

@hexadata & @mmalohlava presents

Page 2: 2014 09 30_sparkling_water_hands_on

Spark and H2OSeveral months ago…

Page 3: 2014 09 30_sparkling_water_hands_on

Sparkling WaterBefore

Tachyon based

Unnecessary data duplication

Now

Pure H2ORDD

Transparent use of H2O data and algorithms with Spark API

Page 4: 2014 09 30_sparkling_water_hands_on

Sparkling Water

����� ���

��

����� ���

��

+RDD

immutable"world

DataFrame mutable"world

Page 5: 2014 09 30_sparkling_water_hands_on

Sparkling Water

����� ���

��

����� ���

�� RDD DataFrame

Page 6: 2014 09 30_sparkling_water_hands_on

Sparkling Water Design

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 7: 2014 09 30_sparkling_water_hands_on

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)

H2O RDD

Spark Executor JVM

Spark Executor JVM

Spark RDD

Page 8: 2014 09 30_sparkling_water_hands_on

Hands-on Time

Page 9: 2014 09 30_sparkling_water_hands_on

Example

Load&Parse CSV data

Use Spark API, do SQL query

Create Deep Learning model

Use model for prediction

Page 10: 2014 09 30_sparkling_water_hands_on

Requirements

Linux or Mac OS X

Oracle Java 1.7

Virtual image is provided

for Windows users

Page 11: 2014 09 30_sparkling_water_hands_on

Downloadhttp://0xdata.com/download/

Page 12: 2014 09 30_sparkling_water_hands_on

Install and Launch

Unpack zip fileorOpen provided virtual image in VirtualBox

and Launch h2o-examples/sparkling-shell

Page 13: 2014 09 30_sparkling_water_hands_on

What is Sparkling Shell?

Standard spark-shell

Launch H2O extension

export MASTER=“local-cluster[3,2,1024]” !spark-shell \ —jars shaded.jar \ —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension

JAR containing H2O code

Name of H2O extension provided by JAR

Spark Master address

Page 14: 2014 09 30_sparkling_water_hands_on

…more on launching…

‣ By default single JVM, multi-threaded (export MASTER=local[*]) or

‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or

‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”

Page 15: 2014 09 30_sparkling_water_hands_on

Lets play with Sparking shell…

Page 16: 2014 09 30_sparkling_water_hands_on

Create H2O Client

import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)

Page 17: 2014 09 30_sparkling_water_hands_on

Is Spark Running?http://localhost:4040

Page 18: 2014 09 30_sparkling_water_hands_on

Is H2O running?http://localhost:54321/steam/index.html

Page 19: 2014 09 30_sparkling_water_hands_on

DataLoad some data and parse them

import java.io.Fileimport org.apache.spark.examples.h2o._import org.apache.spark.h2o._val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz" !// Create DataFrame - involves parse of dataval airlinesData = new DataFrame(new File(dataFile))

Page 20: 2014 09 30_sparkling_water_hands_on

Where are data?Go to http://localhost:54321/steam/index.html

Page 21: 2014 09 30_sparkling_water_hands_on

Use Spark API// H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc)import h2oContext._

// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count

// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter( f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count

Page 22: 2014 09 30_sparkling_water_hands_on

Use Spark SQLimport org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._ airlinesTable.registerTempTable("airlinesTable")

val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tablesresult.count

assert(result.count == flightsOnlyToSF.count)

Page 23: 2014 09 30_sparkling_water_hands_on

Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name

// Create a new model builder val dl = new DeepLearning(dlParams)

val dlModel = dl.train.get

Page 24: 2014 09 30_sparkling_water_hands_on

Make a prediction

// Use model to score data val prediction = dlModel.score(result)(‘predict) !// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

Page 25: 2014 09 30_sparkling_water_hands_on

What is under the hood?

Page 26: 2014 09 30_sparkling_water_hands_on

Spark App Extension/** Notion of Spark application platform extension. */trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s"$desc@$intercept" } /** Supported interception points. * * Currently only Executor life cycle is supported. */object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value}

Page 27: 2014 09 30_sparkling_water_hands_on

Using App Extensions

val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloudconf.set(“spark.h2o.cluster.size”,h2oWorkers) !// Add H2O extensionconf.addExtension[H2OPlatformExtension] !// Create Spark Context val sc = new SparkContext(sc)

Page 28: 2014 09 30_sparkling_water_hands_on

Spark Changes

We keep them small (~30 lines of code)

JIRA SPARK-3270 - Platform App Extensions

https://issues.apache.org/jira/browse/SPARK-3270

Page 29: 2014 09 30_sparkling_water_hands_on

You can participate!Epic PUBDEV-21aka Sparkling Water

PUBDEV-23 Test HDFS reader

PUBDEV-26 Implement toSchemaRDD

PUBDEV-27 Boolean transfers

PUBDEV-31 Support toRDD[ X <: Numeric]

PUBDEV-32/33 Mesos/YARN support

Page 30: 2014 09 30_sparkling_water_hands_on

More infoCheckout 0xdata Blog for tutorials

http://0xdata.com/blog/

Checkout 0xdata Youtube Channel

https://www.youtube.com/user/0xdata

Checkout github

https://github.com/0xdata/h2o-dev

https://github.com/0xdata/perrier

Page 31: 2014 09 30_sparkling_water_hands_on

Learn more about H2O at 0xdata.com

or

Thank you!

Follow us at @hexadata

neo> for r in h2o-dev perrier; do !git clone “[email protected]:0xdata/$r.git”!done