2014 09 30_sparkling_water_hands_on
-
Upload
0xdata -
Category
Data & Analytics
-
view
792 -
download
1
description
Transcript of 2014 09 30_sparkling_water_hands_on
Sparkling Water“Killer App for Spark”
@hexadata & @mmalohlava presents
Spark and H2OSeveral months ago…
Sparkling WaterBefore
Tachyon based
Unnecessary data duplication
Now
Pure H2ORDD
Transparent use of H2O data and algorithms with Spark API
Sparkling Water
����� ���
��
����� ���
��
+RDD
immutable"world
DataFrame mutable"world
Sparkling Water
����� ���
��
����� ���
�� RDD DataFrame
Sparkling Water Design
Sparkling App
jar file
Spark Master JVM
spark-submit
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMData
Source (e.g. HDFS)
H2O RDD
Spark Executor JVM
Spark Executor JVM
Spark RDD
Hands-on Time
Example
Load&Parse CSV data
Use Spark API, do SQL query
Create Deep Learning model
Use model for prediction
Requirements
Linux or Mac OS X
Oracle Java 1.7
Virtual image is provided
for Windows users
Downloadhttp://0xdata.com/download/
Install and Launch
Unpack zip fileorOpen provided virtual image in VirtualBox
and Launch h2o-examples/sparkling-shell
What is Sparkling Shell?
Standard spark-shell
Launch H2O extension
export MASTER=“local-cluster[3,2,1024]” !spark-shell \ —jars shaded.jar \ —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension
JAR containing H2O code
Name of H2O extension provided by JAR
Spark Master address
…more on launching…
‣ By default single JVM, multi-threaded (export MASTER=local[*]) or
‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or
‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”
Lets play with Sparking shell…
Create H2O Client
import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)
Is Spark Running?http://localhost:4040
Is H2O running?http://localhost:54321/steam/index.html
DataLoad some data and parse them
import java.io.Fileimport org.apache.spark.examples.h2o._import org.apache.spark.h2o._val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz" !// Create DataFrame - involves parse of dataval airlinesData = new DataFrame(new File(dataFile))
Where are data?Go to http://localhost:54321/steam/index.html
Use Spark API// H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc)import h2oContext._
// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count
// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter( f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count
Use Spark SQLimport org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._ airlinesTable.registerTempTable("airlinesTable")
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tablesresult.count
assert(result.count == flightsOnlyToSF.count)
Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name
// Create a new model builder val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
Make a prediction
// Use model to score data val prediction = dlModel.score(result)(‘predict) !// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
What is under the hood?
Spark App Extension/** Notion of Spark application platform extension. */trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s"$desc@$intercept" } /** Supported interception points. * * Currently only Executor life cycle is supported. */object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value}
Using App Extensions
val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloudconf.set(“spark.h2o.cluster.size”,h2oWorkers) !// Add H2O extensionconf.addExtension[H2OPlatformExtension] !// Create Spark Context val sc = new SparkContext(sc)
Spark Changes
We keep them small (~30 lines of code)
JIRA SPARK-3270 - Platform App Extensions
https://issues.apache.org/jira/browse/SPARK-3270
You can participate!Epic PUBDEV-21aka Sparkling Water
PUBDEV-23 Test HDFS reader
PUBDEV-26 Implement toSchemaRDD
PUBDEV-27 Boolean transfers
PUBDEV-31 Support toRDD[ X <: Numeric]
PUBDEV-32/33 Mesos/YARN support
More infoCheckout 0xdata Blog for tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/h2o-dev
https://github.com/0xdata/perrier
Learn more about H2O at 0xdata.com
or
Thank you!
Follow us at @hexadata
neo> for r in h2o-dev perrier; do !git clone “[email protected]:0xdata/$r.git”!done