Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

44
Building Machine Learning Applications with Sparkling Water MLConf 2015 NYC Michal Malohlava and Alex Tellez and H2O.ai

Transcript of Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Page 1: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Building Machine Learning Applications with Sparkling Water

MLConf 2015 NYC

Michal Malohlava and Alex Tellez and H2O.ai

Page 2: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

TBD Head of Sales

Distributed Systems Engineers MakingML Scale!

[email protected]

Page 3: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Scalable Machine Learning

For Smarter Applications

Page 4: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Smarter Applications

Page 5: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Scalable Applications

Distributed

Able to process huge amount of data from different sources

Easy to develop and experiment

Powerful machine learning engine inside

Page 6: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

BUT how to build

them?

Page 7: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Build an application with …

Page 8: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Build an application with …

?

Page 9: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

…with Spark and H2O

Page 10: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Open-source distributed execution platform

User-friendly API for data transformation based on RDD

Platform components - SQL, MLLib, text mining

Multitenancy

Large and active community

Page 11: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Open-source scalable machine learning platform

Tuned for efficient computation and memory use

Production ready machine learning algorithms

R, Python, Java, Scala APIs

Interactive UI, robust data parser

Page 12: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC
Page 13: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Sparkling WaterProvides

Transparent integration of H2O with Spark ecosystem

Transparent use of H2O data structures and algorithms with Spark API

Platform for building Smarter Applications

Excels in existing Spark workflows requiring advanced Machine Learning algorithms

Page 14: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lets build an application !

Page 15: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC
Page 16: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

OR

Page 17: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

OR

Page 18: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

OR

Detect spam text messages

Page 19: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Data example

Page 20: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

ML Workflow

1. Extract data

2. Transform, tokenize messages

3. Build Tf-IDF

4. Create and evaluate Deep Learning model

5. Use the model

Goal: For a given text message identify if it is spam or not

Page 21: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lego #1: Data load

// Data loaddef load(dataFile: String): RDD[Array[String]] = { sc.textFile(dataFile).map(l => l.split(“\t")) .filter(r => !r(0).isEmpty)}

Page 22: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lego #2: Ad-hoc Tokenization

def tokenize(data: RDD[String]): RDD[Seq[String]] = { val ignoredWords = Seq("the", “a", …) val ignoredChars = Seq(',', ‘:’, …) val texts = data.map( r => { var smsText = r.toLowerCase for( c <- ignoredChars) { smsText = smsText.replace(c, ' ') } val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct words.toSeq }) texts}

Page 23: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}

“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]

Thank Order

Page 24: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}

Hash words into large

space

“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]

Thank Order

Page 25: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lego #3: Tf-IDFdef buildIDFModel(tokens: RDD[Seq[String]], minDocFreq:Int = 4, hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = { // Hash strings into the given space val hashingTF = new HashingTF(hashSpaceSize) val tf = hashingTF.transform(tokens) // Build term frequency-inverse document frequency val idfModel = new IDF(minDocFreq=minDocFreq).fit(tf) val expandedText = idfModel.transform(tf) (hashingTF, idfModel, expandedText)}

Hash words into large

space

Term freq scale

“Thank for the order…” […,0,3.5,0,1,0,0.3,0,1.3,0,0,…]

Thank Order

Page 26: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Lego #4: Build a modeldef buildDLModel(train: Frame, valid: Frame, epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, hidden: Array[Int] = Array[Int](200, 200)) (implicit h2oContext: H2OContext): DeepLearningModel = { import h2oContext._ // Build a model val dlParams = new DeepLearningParameters() dlParams._destination_key = Key.make("dlModel.hex").asInstanceOf[Key[Frame]] dlParams._train = train dlParams._valid = valid dlParams._response_column = 'target dlParams._epochs = epochs dlParams._l1 = l1 dlParams._hidden = hidden // Create a job val dl = new DeepLearning(dlParams) val dlModel = dl.trainModel.get // Compute metrics on both datasets dlModel.score(train).delete() dlModel.score(valid).delete() dlModel}

Deep Learning: Create multi-layer feed forward neural networks starting w i t h an i npu t l a ye r fo l lowed by mul t ip le l a y e r s o f n o n l i n e a r transformations

Page 27: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Page 28: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Data munging

Page 29: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Split dataset

Data munging

Page 30: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Assembly application// Data loadval data = load(DATAFILE)// Extract response spam or hamval hamSpam = data.map( r => r(0))val message = data.map( r => r(1))// Tokenize message contentval tokens = tokenize(message)// Build IDF modelvar (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)// Merge response with extracted vectorsval resultRDD: SchemaRDD = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2))val table:DataFrame = resultRDD// Split tableval keys = Array[String]("train.hex", "valid.hex") val ratios = Array[Double](0.8) val frs = split(table, keys, ratios)val (train, valid) = (frs(0), frs(1))table.delete()// Build a modelval dlModel = buildDLModel(train, valid)

Split dataset

Build model

Data munging

Page 31: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Data exploration

Page 32: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)

Page 33: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Model evaluationval trainMetrics = binomialMM(dlModel, train)val validMetrics = binomialMM(dlModel, valid)

Collect model metrics

Page 34: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Page 35: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Prepared models

Page 36: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Prepared models

Default decision threshold

Page 37: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Spam predictordef isSpam(msg: String, dlModel: DeepLearningModel, hashingTF: HashingTF, idfModel: IDFModel, hamThreshold: Double = 0.5):Boolean = { val msgRdd = sc.parallelize(Seq(msg)) val msgVector: SchemaRDD = idfModel.transform( hashingTF.transform ( tokenize (msgRdd))) .map(v => SMS("?", v)) val msgTable: DataFrame = msgVector msgTable.remove(0) // remove first column val prediction = dlModel.score(msgTable) prediction.vecs()(1).at(0) < hamThreshold}

Prepared models

Default decision threshold

Scoring

Page 38: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Predict spam

Page 39: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Predict spamisSpam("Michal, beer tonight in MV?")

Page 40: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Predict spamisSpam("Michal, beer tonight in MV?")

Page 41: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Predict spamisSpam("Michal, beer tonight in MV?")

isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")

Page 42: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Predict spamisSpam("Michal, beer tonight in MV?")

isSpam("We tried to contact you re your reply to our offer of a Video Handset? 750 anytime any networks mins? UNLIMITED TEXT?")

Page 43: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai/sparkling-water

Meetups

https://meetup.com/

More info

Page 44: Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Learn more at h2o.ai Follow us at @h2oai

Thank you!Sparkling Water is

open-source ML application platform

combining power of Spark and H2O