Productionizing H2O Models using Sparkling Water by Jakub Hava

39
Productionizing H2O Models using Sparkling Water Jakub Háva [email protected] Webinar February 6, 2018

Transcript of Productionizing H2O Models using Sparkling Water by Jakub Hava

Page 1: Productionizing H2O Models using Sparkling Water by Jakub Hava

Productionizing H2O Models using Sparkling

Water

Jakub Há[email protected]

WebinarFebruary 6, 2018

Page 2: Productionizing H2O Models using Sparkling Water by Jakub Hava

Bio

• Sr. Software Engineer at H2O.ai, Sparkling Water Core Team • Finished Master’s at Charles University in Prague • Implemented high-performance cluster monitoring tool for

JVM based languages (JNI, JVMTI, instrumentation) • Enjoy climbing & tea

Page 3: Productionizing H2O Models using Sparkling Water by Jakub Hava

Distributed Sparkling Team

• Michal - Mountain View, CA • Kuba - Prague, CZ • Mateusz – Tokyo, JP

Page 4: Productionizing H2O Models using Sparkling Water by Jakub Hava

H2O+Spark = Sparkling

Water

Page 5: Productionizing H2O Models using Sparkling Water by Jakub Hava

Sparkling Water

• Transparent integration of H2O with Spark ecosystem - MLlib and H2O side-by-side

• Transparent use of H2O data structures and algorithms with Spark API

• Platform for building Smarter Applications • Excels in existing Spark workflows requiring advanced

Machine Learning algorithms

Page 6: Productionizing H2O Models using Sparkling Water by Jakub Hava

Benefits

• Additional algorithms – NLP

• Powerful data munging • ML Pipelines

• Advanced algorithms

• speed & accuracy

• advanced parameters

• Fully distributed and parallelized

• Graphical environment

• R/Python interface

Page 7: Productionizing H2O Models using Sparkling Water by Jakub Hava

How to use Sparkling

Water

Page 8: Productionizing H2O Models using Sparkling Water by Jakub Hava

Start Spark with Sparkling Water

Page 9: Productionizing H2O Models using Sparkling Water by Jakub Hava

Model Building

Data Source

Data munging Modeling

Deep Learning, GBMDRF, GLM, GLRM

K-Means, PCACoxPH, Ensembles

Prediction processing

Page 10: Productionizing H2O Models using Sparkling Water by Jakub Hava

Data Munging

Data Source

Data load/munging/ exploration Modeling

Page 11: Productionizing H2O Models using Sparkling Water by Jakub Hava

Stream Processing

DataSourceO

fflin

e m

odel

trai

ning

Data munging

Model prediction

Deploy the model

Stre

ampr

oces

sing

Data Stream

Spark Streaming/Storm/Flink

Export modelin a binary format

or as code

Modeling

Page 12: Productionizing H2O Models using Sparkling Water by Jakub Hava

Scoring

• POJO o Plain Old Java Object

• MOJO o Model Object Optimized

• No runtime dependency on H2O framework

Page 13: Productionizing H2O Models using Sparkling Water by Jakub Hava

Features Overview

Page 14: Productionizing H2O Models using Sparkling Water by Jakub Hava

Scala Code in H2O Flow

• New type of cell • Access Spark from Flow UI • Experimenting made easy

Page 15: Productionizing H2O Models using Sparkling Water by Jakub Hava

H2O Frame as Spark’s Datasource

• Use native Spark API to load and save data • Spark can optimize the queries when loading data

from H2O Frame • Use of Spark query optimizer

Page 16: Productionizing H2O Models using Sparkling Water by Jakub Hava

Machine Learning Pipelines

• Wrap our algorithms as Transformers and Estimators • Support for embedding them into Spark ML

Pipelines • Can serialize fitted/unfitted pipelines • Unified API => Arguments are set in the same way

for Spark and H2O Models

Page 17: Productionizing H2O Models using Sparkling Water by Jakub Hava

MLlib Algorithms in Flow UI

• Can examine them in H2O Flow • Can generate POJO/MOJO out of them

Page 18: Productionizing H2O Models using Sparkling Water by Jakub Hava

Pure MLlib Algo from Flow

REST API FacadeREST API Adapter Impl

REST API TCP Port 54321

H2O Flow

Web UI

Mllib Impl

Spark

MLlib SVM Algorithm API Call

Sparkling Water

Page 19: Productionizing H2O Models using Sparkling Water by Jakub Hava

PySparkling Made Easy

• PySparkling is on PyPi • Contains all H2O and Sparkling Water

dependencies, no need to worry about them • Just add in on your Python path and that’s it • Python independent PySparkling distribution -

implemented by [SW-341]

Page 20: Productionizing H2O Models using Sparkling Water by Jakub Hava

RSparkling

• Sparkling Water for R • Based on Sparklyr package • Independent on Spark and Sparkling Water version

Page 21: Productionizing H2O Models using Sparkling Water by Jakub Hava

And others!

• Support for Datasets • Zeppelin notebook support • XGBoost Support (local mode so far) • Support for high-cardinality fast joins • Secure Communication - SSL • Support for Sparse Data conversions • Bug fixes

Page 22: Productionizing H2O Models using Sparkling Water by Jakub Hava

Upcoming Features

• Support for more MLlib algorithms in Flow • Python cell in the H2O Flow • Integration with Steam • More advanced pipelines with MOJOs

Page 23: Productionizing H2O Models using Sparkling Water by Jakub Hava

What’s inside?

Page 24: Productionizing H2O Models using Sparkling Water by Jakub Hava

Two Backends

• Sparkling Water has two backends o External o Internal

• The backend determines where the H2O cluster is located

• Each backend is good for different use cases

Page 25: Productionizing H2O Models using Sparkling Water by Jakub Hava

Internal Backend

Page 26: Productionizing H2O Models using Sparkling Water by Jakub Hava

Sparkling Water Internal Backend

Sparkling App

jar file

Spark Master

JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 27: Productionizing H2O Models using Sparkling Water by Jakub Hava

Internal Backend

Cluster

Worker node

Spark executor Spark executorSpark executor

Driver node

SparkContext

Worker nodeWorker node

H2OContext

Page 28: Productionizing H2O Models using Sparkling Water by Jakub Hava

Data Transfers

H2O

Ser

vice

sH

2O S

ervi

ces

DataSource

Spar

k Ex

ecut

orSp

ark

Exec

utor

Spar

k Ex

ecut

or

Spark Cluster

DataFrame

H2O

Ser

vice

s

H2OFrame

DataSource

Page 29: Productionizing H2O Models using Sparkling Water by Jakub Hava

Pros & Cons

• Advantages • Easy to configure • Faster (no need to send data to different cluster)

• Disadvantages • Spark kills or joins new executor => H2O goes

down • No way how to discover all executors

Page 30: Productionizing H2O Models using Sparkling Water by Jakub Hava

External Backend

Page 31: Productionizing H2O Models using Sparkling Water by Jakub Hava

Overview

• Sparkling Water is using external H2O cluster instead of starting H2O in each executor

• Spark executors can come and go and H2O won’t be affected

• Start H2O cluster on YARN automatically

Page 32: Productionizing H2O Models using Sparkling Water by Jakub Hava

Separation Approach

• Separating Spark and H2O• But preserving same API:

val h2oContext = H2OContext.getOrCreate(http://h2o:54321/)

• Spark and H2O can be submitted as a YARN job and controlled in separation• But H2O still needs non-elastic environment (H2O itself

does not implement HA yet)

Page 33: Productionizing H2O Models using Sparkling Water by Jakub Hava

Sparkling Water External Backend

Sparkling App

jar file

Spark Master JVM

spark-submit

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

H2O Cluster

H2O

Page 34: Productionizing H2O Models using Sparkling Water by Jakub Hava

External Backend

Cluster

Worker node

Spark executor Spark executorSpark executor

Driver node

SparkContext

Worker nodeWorker node

H2OContext

Page 35: Productionizing H2O Models using Sparkling Water by Jakub Hava

Pros & Cons

• Advantages• H2O does not crash when Spark executor goes down• Better resource management since resources can be

planned per tool

• Disadvantages• Transfer overhead between Spark and H2O processes

• Under measurement with cooperation of a customer

Page 36: Productionizing H2O Models using Sparkling Water by Jakub Hava

Modes

• Auto Start Mode • Start h2o cluster automatically on YARN

• Manual Start Mode • User is responsible for starting the cluster

manually

Page 37: Productionizing H2O Models using Sparkling Water by Jakub Hava

Demo

Page 38: Productionizing H2O Models using Sparkling Water by Jakub Hava

More Info

• Documentation: http://docs.h2o.ai

• Tutorials: https://github.com/h2oai/h2o-tutorials

• Slides: https://github.com/h2oai/h2o-meetups

• Videos: https://www.youtube.com/user/0xdata

• Events & Meetups: http://h2o.ai/events

• Stack Overflow: https://stackoverflow.com/tags/sparkling-water

• Google Group: https://tinyurl.com/h2ostream

• Gitter: http://gitter.im/h2oai/sparkling-water

Page 39: Productionizing H2O Models using Sparkling Water by Jakub Hava

Learn more at h2o.ai Follow us at @h2oai

Thank you!