Whitepaper - Productionizing Machine Learning Models with ...
Productionizing H2O Models using Sparkling Water by Jakub Hava
-
Upload
sri-ambati -
Category
Technology
-
view
199 -
download
0
Transcript of Productionizing H2O Models using Sparkling Water by Jakub Hava
Bio
• Sr. Software Engineer at H2O.ai, Sparkling Water Core Team • Finished Master’s at Charles University in Prague • Implemented high-performance cluster monitoring tool for
JVM based languages (JNI, JVMTI, instrumentation) • Enjoy climbing & tea
Distributed Sparkling Team
• Michal - Mountain View, CA • Kuba - Prague, CZ • Mateusz – Tokyo, JP
H2O+Spark = Sparkling
Water
Sparkling Water
• Transparent integration of H2O with Spark ecosystem - MLlib and H2O side-by-side
• Transparent use of H2O data structures and algorithms with Spark API
• Platform for building Smarter Applications • Excels in existing Spark workflows requiring advanced
Machine Learning algorithms
Benefits
• Additional algorithms – NLP
• Powerful data munging • ML Pipelines
• Advanced algorithms
• speed & accuracy
• advanced parameters
• Fully distributed and parallelized
• Graphical environment
• R/Python interface
How to use Sparkling
Water
Start Spark with Sparkling Water
Model Building
Data Source
Data munging Modeling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
Data Munging
Data Source
Data load/munging/ exploration Modeling
Stream Processing
DataSourceO
fflin
e m
odel
trai
ning
Data munging
Model prediction
Deploy the model
Stre
ampr
oces
sing
Data Stream
Spark Streaming/Storm/Flink
Export modelin a binary format
or as code
Modeling
Scoring
• POJO o Plain Old Java Object
• MOJO o Model Object Optimized
• No runtime dependency on H2O framework
Features Overview
Scala Code in H2O Flow
• New type of cell • Access Spark from Flow UI • Experimenting made easy
H2O Frame as Spark’s Datasource
• Use native Spark API to load and save data • Spark can optimize the queries when loading data
from H2O Frame • Use of Spark query optimizer
Machine Learning Pipelines
• Wrap our algorithms as Transformers and Estimators • Support for embedding them into Spark ML
Pipelines • Can serialize fitted/unfitted pipelines • Unified API => Arguments are set in the same way
for Spark and H2O Models
MLlib Algorithms in Flow UI
• Can examine them in H2O Flow • Can generate POJO/MOJO out of them
Pure MLlib Algo from Flow
REST API FacadeREST API Adapter Impl
REST API TCP Port 54321
H2O Flow
Web UI
Mllib Impl
Spark
MLlib SVM Algorithm API Call
Sparkling Water
PySparkling Made Easy
• PySparkling is on PyPi • Contains all H2O and Sparkling Water
dependencies, no need to worry about them • Just add in on your Python path and that’s it • Python independent PySparkling distribution -
implemented by [SW-341]
RSparkling
• Sparkling Water for R • Based on Sparklyr package • Independent on Spark and Sparkling Water version
And others!
• Support for Datasets • Zeppelin notebook support • XGBoost Support (local mode so far) • Support for high-cardinality fast joins • Secure Communication - SSL • Support for Sparse Data conversions • Bug fixes
Upcoming Features
• Support for more MLlib algorithms in Flow • Python cell in the H2O Flow • Integration with Steam • More advanced pipelines with MOJOs
What’s inside?
Two Backends
• Sparkling Water has two backends o External o Internal
• The backend determines where the H2O cluster is located
• Each backend is good for different use cases
Internal Backend
Sparkling Water Internal Backend
Sparkling App
jar file
Spark Master
JVM
spark-submit
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Internal Backend
Cluster
Worker node
Spark executor Spark executorSpark executor
Driver node
SparkContext
Worker nodeWorker node
H2OContext
Data Transfers
H2O
Ser
vice
sH
2O S
ervi
ces
DataSource
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Ser
vice
s
H2OFrame
DataSource
Pros & Cons
• Advantages • Easy to configure • Faster (no need to send data to different cluster)
• Disadvantages • Spark kills or joins new executor => H2O goes
down • No way how to discover all executors
External Backend
Overview
• Sparkling Water is using external H2O cluster instead of starting H2O in each executor
• Spark executors can come and go and H2O won’t be affected
• Start H2O cluster on YARN automatically
Separation Approach
• Separating Spark and H2O• But preserving same API:
val h2oContext = H2OContext.getOrCreate(http://h2o:54321/)
• Spark and H2O can be submitted as a YARN job and controlled in separation• But H2O still needs non-elastic environment (H2O itself
does not implement HA yet)
Sparkling Water External Backend
Sparkling App
jar file
Spark Master JVM
spark-submit
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
H2O Cluster
H2O
External Backend
Cluster
Worker node
Spark executor Spark executorSpark executor
Driver node
SparkContext
Worker nodeWorker node
H2OContext
Pros & Cons
• Advantages• H2O does not crash when Spark executor goes down• Better resource management since resources can be
planned per tool
• Disadvantages• Transfer overhead between Spark and H2O processes
• Under measurement with cooperation of a customer
Modes
• Auto Start Mode • Start h2o cluster automatically on YARN
• Manual Start Mode • User is responsible for starting the cluster
manually
Demo
More Info
• Documentation: http://docs.h2o.ai
• Tutorials: https://github.com/h2oai/h2o-tutorials
• Slides: https://github.com/h2oai/h2o-meetups
• Videos: https://www.youtube.com/user/0xdata
• Events & Meetups: http://h2o.ai/events
• Stack Overflow: https://stackoverflow.com/tags/sparkling-water
• Google Group: https://tinyurl.com/h2ostream
• Gitter: http://gitter.im/h2oai/sparkling-water