H2O Rains with Databricks Cloud - NY 02.16.16
-
Upload
srisatish-ambati -
Category
Technology
-
view
524 -
download
2
Transcript of H2O Rains with Databricks Cloud - NY 02.16.16
![Page 1: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/1.jpg)
H2O Rains with Databricks Cloud
Michal Malohlava @mmalohlava
NYC Meetup 2016/02/16, SF
![Page 2: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/2.jpg)
Who Am I?Background
• PhD in CS from Charles University in Prague, Czech Republic
• Postdoc at Purdue University experimenting with algos for large-scale computation
• Now SW engineer at H2O.ai
Experience with domain-specific languages, distributed system, software engineering,
and big data.
![Page 3: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/3.jpg)
H2O.aiH
2O team
Sri Ambati Cliff ClickCo-
Foun
ders
Stephen Boyd
Rob Tibshirani
TrevorHastie
Scie
ntifi
cA
dvis
ory
Cou
ncil
![Page 4: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/4.jpg)
H2OOpen-Source In-Memory Data Science Platform
• Highly optimized Java code (in-house)
• Distributed in-memory K-V store and map/reduce computation framework
• Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)
• Read/write access to distributed data frames (R/Pandas-style)
• ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles
• REST API: clients Interactive UI/R/Python
![Page 5: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/5.jpg)
H2O+Spark = Sparkling
Water
![Page 6: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/6.jpg)
Open-source distributed execution platform
User-friendly API for data transformation based on RDDs, DataFrames (from 1.4) and DataSets (from 1.6)
Platform components - SQL, MLLib, text mining, Avro, Redshift, Kinesis.
Easily extendable by 3rd party packages Interactive shell
Current release 1.6Supported releases 1.3, 1.4, 1.5
![Page 7: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/7.jpg)
DatabricksDatabricks • founded by the creators of Apache Spark • still contribute 75% of the code to the Spark project • cloud platform for running Spark in your AWS account
Databricks Platform • integrated collaborative data
science workspace • notebook interface inspired by
iPython and Zeplin but purpose built for Spark
• self service cluster manager and job scheduler for production Spark workloads
![Page 8: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/8.jpg)
Sparkling WaterProvides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and algorithms with Spark API
Platform for building Smarter Applications
Excels in existing Spark workflows requiring advanced Machine Learning algorithms
Functionality missing in H2O can be replaced by Spark and vice versa
![Page 9: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/9.jpg)
How to use Sparkling Water?
![Page 10: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/10.jpg)
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
![Page 11: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/11.jpg)
Data Munging
Data Source
Data load/munging/ exploration Modelling
![Page 12: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/12.jpg)
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data munging
Model prediction
Deploy the model
Stre
ampr
oces
sing
Data Stream
Spark Streaming/Storm
Export modelin a binary format
or as code
Modelling
![Page 13: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/13.jpg)
What is inside?
![Page 14: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/14.jpg)
Databricks
Worker node
Spark executor
Scala/Py main program
Driver node
H2OContext
SparkContext
Worker node
Spark executor
Worker node
Spark executor
![Page 15: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/15.jpg)
H2O
Ser
vice
sH
2O S
ervi
ces
Data Source
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Ser
vice
s
H2OFrame
Data Source
h2oContext.asDataFrame
h2oContext.asH2OFrame
![Page 16: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/16.jpg)
DEMO Time!
![Page 17: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/17.jpg)
What do we need?Databricks account (14 day free trial at www.databricks.com)
AWS account
Sparkling Water coordinates: ai.h2o:sparkling-water-examples_2.10:1.5.10
And some cool machine learning idea!
![Page 18: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/18.jpg)
OR
Detect spam text messages
![Page 19: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/19.jpg)
Data sample
![Page 20: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/20.jpg)
Goal
For a given text message
identify if it is spam or not
![Page 21: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/21.jpg)
Machine Learning Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF model
4. Create and evaluate Deep Learning model
5. Use the model to detect spam
![Page 22: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/22.jpg)
Checkout H2O.ai Training Books
http://learn.h2o.ai/
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
![Page 23: H2O Rains with Databricks Cloud - NY 02.16.16](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f78ac1a28ab10258b6ccb/html5/thumbnails/23.jpg)
Learn more at h2o.ai Follow us at @h2oai
Thank you!Sparkling Water is
open-source ML application platform
combining power of Spark and H2O