Data By the Bay 2016 - May 17, 2016
-
Upload
michelle-casbon -
Category
Software
-
view
630 -
download
1
Transcript of Data By the Bay 2016 - May 17, 2016
![Page 1: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/1.jpg)
Using for NLPMichelle CasbonText By the BayMay 17, 2016San Francisco
![Page 2: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/2.jpg)
The construction of predictive models, trained on features
extracted from raw text
![Page 3: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/3.jpg)
Turn text into numbers, do some math, and turn
it back into text.
![Page 4: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/4.jpg)
NLP in the wild• Data ingestion• Interactive Voice Response• SMS prioritization• Multilingual news• Release feedback• Intent to purchase
![Page 5: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/5.jpg)
Prediction
![Page 6: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/6.jpg)
Math to the rescue
ln[p/(1-p)] = a + BX + e
p/(1-p) = e(a + BX + e)
p = 1/[1 + e(-a - BX)]
![Page 7: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/7.jpg)
MLlib to the rescue
![Page 8: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/8.jpg)
Training Datapipeline.fit(training)
![Page 9: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/9.jpg)
[1.0, 3.0, 7.0, …]
![Page 10: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/10.jpg)
IdiML to the rescuehttps://github.com/g-c-k/idiml
![Page 11: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/11.jpg)
IdiML• Feature extraction• Model training• Prediction
[1.0, [1.0, 0.0, 3.0]]
FeatureExtraction
Training
Prediction
[1.0, 0.0, 3.0]
Lorem ipsumdolor sitamet,consecteturadipiscing elit
PROFIT
![Page 12: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/12.jpg)
Featurization
ExtractContent Tokenize
Bigrams
Trigrams
FeatureLookup
[1.0, 0.0, 3.0]
Vector
Lorem ipsumdolor sitamet,consecteturadipiscing elit
![Page 13: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/13.jpg)
Model Training
LogisticRegressionWithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
ModelStorage
[1.0, 0.0, 3.0]
Vector
Addclassification
LogisticRegressionModel
![Page 14: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/14.jpg)
Prediction
ExtractContent Tokenize
Bigrams
Trigrams
FeatureLookup
[0.0, 1.0, 4.0]
Vector
ModelLookup
Predict
Newdocument
[0.0, 1.0, 4.0]
Vector
ClassificationLookup
Lorem ipsumdolor sitamet,consecteturadipiscing elit
PROFIT
![Page 15: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/15.jpg)
What makes it so great?
![Page 16: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/16.jpg)
Single object
![Page 17: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/17.jpg)
Flexibility• Deployment environment• Device• Logging framework
![Page 18: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/18.jpg)
Standardization for developers
Corefunctionality CustomML
…
RESTAPI
IdiMLpersistence
layer
![Page 19: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/19.jpg)
Version Control
![Page 20: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/20.jpg)
Hyperparameter Tuning
![Page 21: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/21.jpg)
Performance… if you have small data
Task Timein µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow ...
![Page 22: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/22.jpg)
Performance
![Page 23: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/23.jpg)
Computing power to process the entire Twitter feed in real-time
from this: to this:
![Page 24: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/24.jpg)
What’s next for IdiML?• Support more statistical
models• Expand automated
hyperparameter tuning across the full training pipeline• Support more options
for featurization• Generic external
touchpoints
![Page 25: Data By the Bay 2016 - May 17, 2016](https://reader031.fdocuments.us/reader031/viewer/2022030316/587a69d31a28ab8a2a8b627d/html5/thumbnails/25.jpg)
Summary• Flexibility, speed, woot!• Continuous stream processing, woot!• Multi-language support, woot!• Scala & MLlib, woot!