Productionizing H2O Models using Sparkling Water by Jakub Hava

Productionizing H2O Models using Sparkling

Water

Jakub Há[email protected]

WebinarFebruary 6, 2018

mailto:[email protected]

Bio

• Sr. Software Engineer at H2O.ai, Sparkling Water Core Team • Finished Master’s at Charles University in Prague • Implemented high-performance cluster monitoring tool for

JVM based languages (JNI, JVMTI, instrumentation) • Enjoy climbing & tea

Distributed Sparkling Team

• Michal - Mountain View, CA • Kuba - Prague, CZ • Mateusz – Tokyo, JP

H2O+Spark = Sparkling

Water

Sparkling Water

• Transparent integration of H2O with Spark ecosystem - MLlib and H2O side-by-side

• Transparent use of H2O data structures and algorithms with Spark API

• Platform for building Smarter Applications • Excels in existing Spark workflows requiring advanced

Machine Learning algorithms

Benefits

• Additional algorithms – NLP

• Powerful data munging • ML Pipelines

• Advanced algorithms

• speed & accuracy

• advanced parameters

• Fully distributed and parallelized

• Graphical environment

• R/Python interface

How to use Sparkling

Water

Start Spark with Sparkling Water

Model Building

Data Source

Data munging Modeling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Data Munging

Data Source

Data load/munging/ exploration Modeling

Stream Processing

DataSourceO

fflin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm/Flink

Export modelin a binary format

or as code

Modeling

Scoring

• POJO o Plain Old Java Object

• MOJO o Model Object Optimized

• No runtime dependency on H2O framework

Features Overview

Scala Code in H2O Flow

• New type of cell • Access Spark from Flow UI • Experimenting made easy

H2O Frame as Spark’s Datasource

• Use native Spark API to load and save data • Spark can optimize the queries when loading data

from H2O Frame • Use of Spark query optimizer

Machine Learning Pipelines

• Wrap our algorithms as Transformers and Estimators • Support for embedding them into Spark ML

Pipelines • Can serialize fitted/unfitted pipelines • Unified API => Arguments are set in the same way

for Spark and H2O Models

MLlib Algorithms in Flow UI

• Can examine them in H2O Flow • Can generate POJO/MOJO out of them

Pure MLlib Algo from Flow

REST API FacadeREST API Adapter Impl

REST API TCP Port 54321

H2O Flow

Web UI

Mllib Impl

Spark

MLlib SVM Algorithm API Call

Sparkling Water

PySparkling Made Easy

• PySparkling is on PyPi • Contains all H2O and Sparkling Water

dependencies, no need to worry about them • Just add in on your Python path and that’s it • Python independent PySparkling distribution -

implemented by [SW-341]

RSparkling

• Sparkling Water for R • Based on Sparklyr package • Independent on Spark and Sparkling Water version

And others!

• Support for Datasets • Zeppelin notebook support • XGBoost Support (local mode so far) • Support for high-cardinality fast joins • Secure Communication - SSL • Support for Sparse Data conversions • Bug fixes

Upcoming Features

• Support for more MLlib algorithms in Flow • Python cell in the H2O Flow • Integration with Steam • More advanced pipelines with MOJOs

What’s inside?

Two Backends

• Sparkling Water has two backends o External o Internal

• The backend determines where the H2O cluster is located

• Each backend is good for different use cases

Internal Backend

Sparkling Water Internal Backend

Sparkling App

jar file

Spark Master

JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Internal Backend

Cluster

Worker node

Spark executor Spark executorSpark executor

Driver node

SparkContext

Worker nodeWorker node

H2OContext

Data Transfers

H2O

Ser

vice

sH

2O S

ervi

ces

DataSource

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

DataSource

Pros & Cons

• Advantages • Easy to configure • Faster (no need to send data to different cluster)

• Disadvantages • Spark kills or joins new executor => H2O goes

down • No way how to discover all executors

External Backend

Overview

• Sparkling Water is using external H2O cluster instead of starting H2O in each executor

• Spark executors can come and go and H2O won’t be affected

• Start H2O cluster on YARN automatically

Separation Approach

• Separating Spark and H2O• But preserving same API:

val h2oContext = H2OContext.getOrCreate(http://h2o:54321/)

• Spark and H2O can be submitted as a YARN job and controlled in separation• But H2O still needs non-elastic environment (H2O itself

does not implement HA yet)

http://h2o:54321/

http://h2o:54321/

Sparkling Water External Backend

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

H2O Cluster

H2O

External Backend

Cluster

Worker node

Spark executor Spark executorSpark executor

Driver node

SparkContext

Worker nodeWorker node

H2OContext

Pros & Cons

• Advantages• H2O does not crash when Spark executor goes down• Better resource management since resources can be

planned per tool

• Disadvantages• Transfer overhead between Spark and H2O processes

• Under measurement with cooperation of a customer

Modes

• Auto Start Mode • Start h2o cluster automatically on YARN

• Manual Start Mode • User is responsible for starting the cluster

manually

More Info

• Documentation: http://docs.h2o.ai

• Tutorials: https://github.com/h2oai/h2o-tutorials

• Slides: https://github.com/h2oai/h2o-meetups

• Videos: https://www.youtube.com/user/0xdata

• Events & Meetups: http://h2o.ai/events

• Stack Overflow: https://stackoverflow.com/tags/sparkling-water

• Google Group: https://tinyurl.com/h2ostream

• Gitter: http://gitter.im/h2oai/sparkling-water

http://docs.h2o.ai

https://github.com/h2oai/h2o-tutorials

https://github.com/h2oai/h2o-meetups

https://www.youtube.com/user/0xdata

http://h2o.ai/events

https://stackoverflow.com/tags/sparkling-water

https://tinyurl.com/h2ostream

http://gitter.im/h2oai/sparkling-water

Learn more at h2o.ai Follow us at @h2oai

Thank you!

http://www.h2o.ai

https://twitter.com/h2oai

Productionizing H2O Models using Sparkling Water by Jakub Hava

Technology

Transcript of Productionizing H2O Models using Sparkling Water by Jakub Hava