Building Machine Learning Applications with Sparkling Water

36
Building Machine Learning Applications with Sparkling Water NYC Big Data Science Meetup Michal Malohlava and Alex Tellez and H2O.ai

Transcript of Building Machine Learning Applications with Sparkling Water

Page 1: Building Machine Learning Applications with Sparkling Water

Building Machine Learning Applications with Sparkling Water

NYC Big Data Science Meetup

Michal Malohlava and Alex Tellez and H2O.ai

Page 2: Building Machine Learning Applications with Sparkling Water

Who am I?Background

PhD in CS from Charles University in Prague, 2012

1 year PostDoc at Purdue University experimenting with algos for large-scale computation

2 years at H2O.ai helping to develop H2O engine for big data computation and analysis

Experience with domain-specific languages, distributed system, software engineering, and big data.

Page 3: Building Machine Learning Applications with Sparkling Water

TBD Head of Sales

Distributed Systems Engineers MakingML Scale!

[email protected]

Page 4: Building Machine Learning Applications with Sparkling Water

Scalable Machine Learning

For Smarter Applications

Page 5: Building Machine Learning Applications with Sparkling Water

Smarter Applications

Page 6: Building Machine Learning Applications with Sparkling Water

Scalable Applications

Distributed

Easy to experiment

Able to process huge data from different sources

Powerful machine learning engine inside

Page 7: Building Machine Learning Applications with Sparkling Water

BUT how to build

them?

Page 8: Building Machine Learning Applications with Sparkling Water

Build an application with …

?

Page 9: Building Machine Learning Applications with Sparkling Water

…with Spark and H2O

Page 10: Building Machine Learning Applications with Sparkling Water

Open-source distributed execution platform

User-friendly API for data transformation based on RDD

Platform components - SQL, MLLib, text mining

Multitenancy

Large and active community

Page 11: Building Machine Learning Applications with Sparkling Water

Open-source scalable machine learning platform

Tuned for efficient computation and memory use

Mature machine learning algorithms

R, Python, Java, Scala APIs

Interactive UI

Page 12: Building Machine Learning Applications with Sparkling Water

Ensembles

Deep Neural Networks

• Generalized Linear Models : Binomial, Gaussian, Gamma, Poisson and Tweedie

• Cox Proportional Hazards Models • Naïve Bayes • Distributed Random Forest : Classification or

regression models • Gradient Boosting Machine : Produces an

ensemble of decision trees with increasing refined approximations

• Deep learning : Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Statistical Analysis

Dimensionality Reduction

Anomaly Detection

• K-means : Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis : Linearly transforms correlated variables to independent components

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Clustering

Supe

rvis

ed L

earn

ing

Unsupervised Learning

Page 13: Building Machine Learning Applications with Sparkling Water
Page 14: Building Machine Learning Applications with Sparkling Water

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Platform to build Smarter Applications

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Page 15: Building Machine Learning Applications with Sparkling Water

Sparkling Water Design

spark-submitSpark Master JVM

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Sparkling App

implements

?

Contains application and Sparkling Water

classes

Page 16: Building Machine Learning Applications with Sparkling Water

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)

H2O RDD

Spark Executor JVM

Spark Executor JVM

Spark RDD

RDDs and DataFrames share same memory

space

Page 17: Building Machine Learning Applications with Sparkling Water

Development InternalsSparkling Water Assembly

H2O Core

H2O Algos

H2O Scala API

H2O Flow

Sparkling Water Core

Spark Platform

Spark Core

Spark SQL

Application Code+

Assembly is deployedto Spark cluster as regular

Spark application

Page 18: Building Machine Learning Applications with Sparkling Water

Lets build an application !

Page 19: Building Machine Learning Applications with Sparkling Water

OR

Detect spam text messages

Page 20: Building Machine Learning Applications with Sparkling Water

Data example

case class SMS(target: String, fv: Vector)

Page 21: Building Machine Learning Applications with Sparkling Water

ML Workflow

1. Extract data

2. Transform, tokenize messages

3. Build Tf-IDF

4. Create and evaluate Deep Learning model

5. Use the model

Goal: For a given text message identify if it is spam or not

Page 22: Building Machine Learning Applications with Sparkling Water

Application environment

Page 23: Building Machine Learning Applications with Sparkling Water

Lego #1: Data load

// Data loaddef load(dataFile: String): RDD[Array[String]] = { sc.textFile(dataFile).map(l => l.split(“\t")) .filter(r => !r(0).isEmpty)}

Page 24: Building Machine Learning Applications with Sparkling Water

Lego #2: Ad-hoc Tokenization

def tokenize(data: RDD[String]): RDD[Seq[String]] = { val ignoredWords = Seq("the", “a", …) val ignoredChars = Seq(',', ‘:’, …) val texts = data.map( r => { var smsText = r.toLowerCase for( c <- ignoredChars) { smsText = smsText.replace(c, ' ') } val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct words.toSeq }) texts}

Page 25: Building Machine Learning Applications with Sparkling Water

Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}

Hash words into large

space

Term freq scale

“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]

Thank Order

Page 26: Building Machine Learning Applications with Sparkling Water

Lego #4: Build a modeldef buildDLModel(train: Frame, valid: Frame, epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, hidden: Array[Int] = Array[Int](200, 200)) (implicit h2oContext: H2OContext): DeepLearningModel = { import h2oContext._ // Build a model val dlParams = new DeepLearningParameters() dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]] dlParams._train = train dlParams._valid = valid dlParams._response_column = 'target dlParams._epochs = epochs dlParams._l1 = l1 dlParams._hidden = hidden // Create a job val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get // Compute metrics on both datasets dlModel.score(train).delete() dlModel.score(valid).delete() dlModel}

Deep Learning: Create multi-layer feed forward neural networks starting w i t h an i npu t l a ye r fo l lowed by mul t ip le l a y e r s o f n o n l i n e a r transformations

Page 27: Building Machine Learning Applications with Sparkling Water

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Split dataset

Build model

Data munging

Page 28: Building Machine Learning Applications with Sparkling Water

Data exploration

Page 29: Building Machine Learning Applications with Sparkling Water

Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)

Collect model metrics

Page 30: Building Machine Learning Applications with Sparkling Water

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Prepared models

Default decision threshold

Scoring

Page 31: Building Machine Learning Applications with Sparkling Water

Predict spamisSpam("Michal, beer tonight in MV?")

isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")

Page 32: Building Machine Learning Applications with Sparkling Water

Interactions with application from R

Page 33: Building Machine Learning Applications with Sparkling Water

Where is the code?https://github.com/h2oai/sparkling-water/

blob/master/examples/scripts/

Page 34: Building Machine Learning Applications with Sparkling Water

Sparkling Water Downloadhttp://h2o.ai/download/

http://h2o-release.s3.amazonaws.com/sparkling-water/master/91/index.html

Page 35: Building Machine Learning Applications with Sparkling Water

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 36: Building Machine Learning Applications with Sparkling Water

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O