Spark + H20 = Machine Learning at scale

Spark + H2O = Machine Learning at scale

Mateusz Dymczyk Software Engineer

Machine Learning with Spark Tokyo 30.06.2016

Agenda

• Spark introduction • H2O introduction • Spark + H2O = Sparkling Water • Demos

What is Spark?

• Fast and general engine for large-scale data processing. • API in Java, Scala, Python and R • Batch and streaming APIs • Based on immutable data structure

*http://spark.apache.org/

Architecture

*http://spark.apache.org/docs/latest/cluster-overview.html

Why Spark?

• In-memory computation (fast) • Ability to cache (intermediate) results in memory (or on

disk) • Easy API • Plenty of out-of-the box libraries

*http://spark.apache.org/docs/latest/mllib-guide.html

• Spark’s machine learning library • Supports: • basic statistics • classification and regression • clustering • dimensionality reduction • evaluations • … *http://spark.apache.org/docs/latest/mllib-guide.html

Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5

val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()

// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

*http://spark.apache.org/docs/latest/mllib-linear-methods.html

But…

• Are the implementations fast enough? • Are the implementations accurate enough? • What about other algorithms (i.e. where’s my

DeepLearning!)? • What about visualisations?

*http://spark.apache.org/docs/latest/mllib-guide.html

Math platform

What is H2O?

• Open source • Set of math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API

Math platform

What is H2O?

• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API

• Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Based on mutable data structures

Math platform

Big data focused

What is H2O?

FlowUI

• Notebook style open source interface for H2O

• Allows you to combine code execution, text, mathematics, plots, and rich media in a single document

Why H2O?

• Speed and accuracy

• Algorithms/functionality not present in MLlib

• Access to FlowUI

• Possibility to generate dependency free (Java) models

• Option to checkpoint models (though not all) and continue learning in the future

Sparkl ing Water

What is Sparkl ing Water?

• Framework integrating Spark and H2O • Transparent use of H2O data structures and algorithms

with Spark API and vice versa

Common use-cases

Modeling

Data Source

Modell ing Predict ions

Deep learning, GBM, DRF, GLM, PCA, Ensembles

Data Source

Modell ing Predict ions

Stream Processing

Data Source

Modell ing

Predict ions

Data Stream

Spark Streaming/ Storm/Flink etc.

Demo #1 Sparkl ing Shell

REQUIREMENTS • Windows/Linux/MacOS • Java 1.7+ • Spark 1.3+ • SPARK_HOME set

INSTALLATION 1. http://www.h2o.ai/download 2. set MASTER env 3. unzip 4. run bin/sparkling-shell

DEV FLOW 1. create a script file containing application code

2. run with bin/sparkling-shell -i script_name.script.scala OR

1. run bin/sparkling-shell and simply use the REPL

import org.apache.spark.h2o._

// sc - SparkContext already provided by the shell

val h2oContext = new H2OContext(sc).start() import h2oContext._

// Application logic

Air l ine delay classif ication

Model predicting flight

delays

ETL Modell ing Predict ions

• load data from CSVs

• use Spark APIs to filter and join data

Model using H2O’s GBM

*https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

Gradient Boosting Machines

• Classification and regression predictive modelling • Ensemble of multiple weak models (usually decision trees) • Iteratively solves residuals (gradient boosted) • Stochastic

Demo #2 FlowUI

Demo #3 Standalone app

REQUIREMENTS

• git • editor of choice (IntelliJ/eclipse support)

BOOTSTRAP

1. git clone https://github.com/h2oai/h2o-droplets.git 2. cd h2o-droplets/sparkling-water-droplet 3. if using IntelliJ or Eclipse: – ./gradlew idea – ./gradlew eclipse – import project in the IDE

4. develop your app

DEPLOYMENT

1. build ./gradlew build shadowJar 2. submit with:

$SPARK_HOME/bin/spark-submit \ --class water.droplets.SWTokyoDemo \ --master local[*] \ --packages ai.h2o:sparkling-water-core_2.10:1.6.5 \ build/libs/sparkling-water-droplet-app.jar

Open Source

• Github:

https://github.com/h2oai/sparkling-water

• JIRA:

http://jira.h2o.ai

• Google groups:

https://groups.google.com/forum/?hl=en#!forum/h2ostream

More Info

• Documentation and booklets:

http://www.h2o.ai/docs/

• H2O.ai blog:

http://h2o.ai/blog

• H2O.ai YouTube channel:

https://www.youtube.com/user/0xdata

@h2oai

http://www.h2o.ai

Thank you!

@mdymczyk

Mateusz Dymczyk

mateusz@h2o.ai

Spark + H20 = Machine Learning at scale

Data & Analytics

Transcript of Spark + H20 = Machine Learning at scale

APPLICATION OF PULSE SPARK DISCHARGES FOR SCALE …

Training Large-scale Ad Ranking Models in Spark

Spark Streaming Large-scale near-real-time stream processing

Evans Cycles...spark 700/900 e-spark 700 (s-l) genius 700/900 e-genius 900 scale/ spark 700 plus genius 700/lt plus (s-l) e-genius plus conte-ssa spark/ scale 700 rc conte-ssa genius

Convolutional Neural Networks at scale in Spark MLlib

Large Scale Machine learning with Spark

Data Science at Scale with Spark

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summit East talk by Berni Schiefer

UNET: Massive Scale DNN on Spark

A Year With Spark - Meetupfiles.meetup.com/13722842/Spark Meetup.pdfSpark at scale: Big Data Example Tuning and Performance Spark at Scale: Big Data Example Yes, we use Spark !! Not

Cell Membrane and Transport Review · Passive transport Facilitated diffusion Active transport Iiíííí: tiíi} Diffusion . H20 Lysed H20 Turgid (normal) H20 Normal H20 Flaccid

Spark Streaming Large -scale near-real-time stream processing

Bulletproof Jobs: Patterns For Large-Scale Spark Processing

Spark & Hadoop at Production at Scale

Data Science at Scale with Spark (pdf)

Deploying Accelerators At Datacenter Scale Using Spark

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)

Petabyte-Scale Text Processing with Spark

SPARK - malifauzi.lecture.ub.ac.idmalifauzi.lecture.ub.ac.id/files/2019/02/Spark.pdf · Spark • Spark adalah engine analitik umum (general engine) yang cepat dalam pemrosesan large-scale

· 2019-09-04 · PRESENTATION . WHO WE ARE DCI Digital - an enterprise-scale division of Dot Com ... TensorFIow, hadoop, AWS, APACHE Spark, TESSERACT, H20.ai, RegEx, NumPy & sciPy