2014 09 30_sparkling_water_hands_on

Sparkling Water“Killer App for Spark”

@hexadata & @mmalohlava presents

Spark and H2OSeveral months ago…

Sparkling WaterBefore

Tachyon based

Unnecessary data duplication

Pure H2ORDD

Transparent use of H2O data and algorithms with Spark API

Sparkling Water

��

immutable"world

DataFrame mutable"world

Sparkling Water

��

�� RDD DataFrame

Sparkling Water Design

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

Sparkling Water Cluster

Spark Executor JVM

Data Distribution

Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)

H2O RDD

Spark Executor JVM

Spark RDD

Hands-on Time

Example

Load&Parse CSV data

Use Spark API, do SQL query

Create Deep Learning model

Use model for prediction

Requirements

Linux or Mac OS X

Oracle Java 1.7

Virtual image is provided

for Windows users

Downloadhttp://0xdata.com/download/

Install and Launch

Unpack zip fileorOpen provided virtual image in VirtualBox

and Launch h2o-examples/sparkling-shell

What is Sparkling Shell?

Standard spark-shell

Launch H2O extension

export MASTER=“local-cluster[3,2,1024]” !spark-shell \ —jars shaded.jar \ —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension

JAR containing H2O code

Name of H2O extension provided by JAR

Spark Master address

…more on launching…

‣ By default single JVM, multi-threaded (export MASTER=local[*]) or

‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or

‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”

Lets play with Sparking shell…

Create H2O Client

import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)

Is Spark Running?http://localhost:4040

Is H2O running?http://localhost:54321/steam/index.html

DataLoad some data and parse them

import java.io.Fileimport org.apache.spark.examples.h2o._import org.apache.spark.h2o._val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz" !// Create DataFrame - involves parse of dataval airlinesData = new DataFrame(new File(dataFile))

Where are data?Go to http://localhost:54321/steam/index.html

Use Spark API// H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc)import h2oContext._

// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count

// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter( f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count

Use Spark SQLimport org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._ airlinesTable.registerTempTable("airlinesTable")

val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tablesresult.count

assert(result.count == flightsOnlyToSF.count)

Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name

// Create a new model builder val dl = new DeepLearning(dlParams)

val dlModel = dl.train.get

Make a prediction

// Use model to score data val prediction = dlModel.score(result)(‘predict) !// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )

What is under the hood?

Spark App Extension/** Notion of Spark application platform extension. */trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s"$desc@$intercept" } /** Supported interception points. * * Currently only Executor life cycle is supported. */object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value}

Using App Extensions

val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloudconf.set(“spark.h2o.cluster.size”,h2oWorkers) !// Add H2O extensionconf.addExtension[H2OPlatformExtension] !// Create Spark Context val sc = new SparkContext(sc)

Spark Changes

We keep them small (~30 lines of code)

JIRA SPARK-3270 - Platform App Extensions

https://issues.apache.org/jira/browse/SPARK-3270

You can participate!Epic PUBDEV-21aka Sparkling Water

PUBDEV-23 Test HDFS reader

PUBDEV-26 Implement toSchemaRDD

PUBDEV-27 Boolean transfers

PUBDEV-31 Support toRDD[ X <: Numeric]

PUBDEV-32/33 Mesos/YARN support

More infoCheckout 0xdata Blog for tutorials

http://0xdata.com/blog/

Checkout 0xdata Youtube Channel

https://www.youtube.com/user/0xdata

Checkout github

https://github.com/0xdata/h2o-dev

https://github.com/0xdata/perrier

Learn more about H2O at 0xdata.com

Thank you!

Follow us at @hexadata

neo> for r in h2o-dev perrier; do !git clone “git@github.com:0xdata/$r.git”!done

2014 09 30_sparkling_water_hands_on

Data & Analytics

Transcript of 2014 09 30_sparkling_water_hands_on

Fresh Market 2014 09 16 - 2014 09 22

Fresh Market 2014 09 09 - 2014 09 15

November 09, 2014

2014 05 09

09 April 2014

Daily Instructions World History 09/04/2014 and 09/05/2014.

26/09/2014 Date 25-09-2014 Successful in the city

Kiwaniscope 09 09 2014

09 july 2014

September 09 2014.GWB - 1/9 - Tue Sep 09 2014 07:36:56...September 09 2014.GWB - 4/9 - Tue Sep 09 2014 10:41:51 September 09 2014.GWB - 5/9 - Tue Sep 09 2014 10:45:49 September 09

July 09, 2014

Alvorada 2014 09

Corsaire 09/2014

March 09, 2014

December 09 2014.GWB - 1/15 - Tue Dec 09 2014 12:51:56 · 2018-12-18 · December 09 2014.GWB - 6/15 - Tue Dec 09 2014 12:39:19 December 09 2014.GWB - 7/15 - Tue Dec 09 2014 12:49:19

DOM - 09/09/2014

Games 2014 09

April 09, 2014

Modern Methods in Software Engineering · Homework Start Date Due Date Description 2014-09-08 2014-09-15 Homework 1 2014-09-15 2014-09-23 Homework 2 2014-09-23 2014-09-30 Homework

University Of Hail College Of Business Administration COOP ... · 10 24/08/2014 28/08/2014 11 31/08/2014 04/09/2014 12 07/09/2014 11/09/2014 Monthly Report 03 13 14/09/2014 18/09/2014