Post on 06-Jan-2017
Combining Machine Learning frameworks with Apache SparkTim HunterHadoop SummitJune 2016
About meApache Spark contributor (since Spark 0.6)
Software Engineer @ Databricks
Ph.D. in Machine Learning @ UC Berkeley
2
Founded by the team who created Apache Spark
Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment
About Databricks
3
Apache Spark• The most active open-source project in big data
• Large-scale machine learning on Apache SparkSpark MLlib
MLlib’s MissionMLlib’s mission is to make practical machine learning easy and scalable.
• Easy to build machine learning applications• Capable of learning from large-scale datasets• Easy to integrate into existing workflows
6
Algorithm Coverage• Classification• Logistic regression• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron
• Regression• Ordinary least squares• Ridge regression• Lasso• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods• Generalized Linear Models
• Frequent itemsets• FP-growth• PrefixSpan
7
Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering• Bisecting K-Means
Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation• Kolmogorov–Smirnov test• Online hypothesis testing• Survival analysis
Linear algebra• Local dense & sparse vectors & matrices• Normal equation for least squares• Distributed matrices
• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix
• Matrix decompositions
Recommendation• Alternating Least Squares
Feature extraction & selection• Word2Vec• Chi-Squared selection• Hashing term frequency• Inverse document frequency• Normalizer• Standard scaler• Tokenizer• One-Hot Encoder• StringIndexer• VectorIndexer• VectorAssembler• Binarizer• Bucketizer• ElementwiseProduct• PolynomialExpansion• Quantile discretizer• SQL transformer
Model import/exportPipelines
List based on Spark 2.0
Outline• ML workflows are complex• Spark as a scheduler• Integration within single-machine frameworks• Unified cross-languages ML pipelines with MLlib
8
ML workflows are complex• Specify the pipeline• Re-run on new data• Inspect the results• Tune the parameters
• Usually, each step of a pipeline is easier with one framework
9
ML Workflows are Complex
10
Train model 1
Evaluate
Datasource 1Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
Existing tools• Scikit-learn
– Excellent documentation– Standard for Python
• R– Lots of packages available
• Pandas– Very easy to use
• A lot of investment in tooling and education– How to integrate big data with these tools?
11
Common misconceptions• Spark is for big data only• Spark can only work with dedicated, distributed
libraries
12
Spark as a scheduler• A lot of tasks in ML are ”embarrassingly parallel”• Use Spark for data management and for scheduling
13
One example: learning digits• Learning tasks: given set of images, recognized digits• Standard benchmark dataset in computer vision built
by NIST:
14
Training Deep Learning algorithms
• Training a neural network is hard:• It is a sequential procedure (present one image after the other to
learn from)• It can be sensitive to noise and order of images: robustness
analysis is critical• Tuning the training parameters (descent rate, batch sizes, etc.) is
very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice.
15
TensorFlow as a training library• A lot of algorithms have been presented for this task,
we will choose TensorFlow, from Google:• Popular choice for neural network training and deep
learning• Competitive performance• Easy to experiment with• Python interface makes it easy to integrate with Spark
16
Distributing TensorFlow computations
• Even if TF is used as a single-machine library, we get speedups from Spark
17
Distributed Cross Validation
...
Best Model
Model #1Training
Model #2Training
Model #3 Training
Distributing TensorFlow computations
18
Distributed Cross Validation
...
Best Model
Model #4Training
Model #6Training
Model #3 Training
Model #1Training
Model #5Training
Model #2Training
Results• Running a 2-layer neural network, and testing for different
update rates and different layer sizes
19
1 node 2 nodes 13 nodes0
3000
6000
9000
12000
Embedding deep learning in Spark
• Best known algorithms are essentially sequential during training• Careful selection of training parameters is critical• Spark can help for fast iterations and find a good set of
parameters
20
A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings
21
Example: sentiment analysis
22
Given a review (text), predict the user’s rating.
Data from https://snap.stanford.edu/data/web-Amazon.html
ML Workflow
23
Train model
Evaluate
Load data
Extract features
Review: This product doesn't seem to be made to last… Rating: 2
feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0
Regression: (review: String) => Double
Load Data
24
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
LIBSVM
Train model
Evaluate
Load data
Extract features
Extract Features
words: [this, product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Review: This product doesn't seem to be made to last… Rating: 2
Prediction: 3.0
Train model
Evaluate
Load data
Tokenizer
Hashed Term Frequ.
Extract Features
words: [this, product, doesn't, seem, to, …]
feature_vector: [0.1 -1.3 0.23 … -0.74]
Review: This product doesn't seem to be made to last… Rating: 2
Prediction: 3.0
Linear regression
Evaluate
Load data
Tokenizer
Hashed Term Frequ.
Our ML workflow
27
Cross Validation
Model TrainingFeature Extraction
regularizationparameter:{0.0, 0.1, ...}
Cross validation
28
Cross Validation
...
Best Model
Model #1 Training
Model #2 Training
Feature Extraction
Model #3 Training
Example
29
A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings
30
DataFrame-based API for MLliba.k.a. “Pipelines” API, with utilities for constructing ML Pipelines
In 2.0, the DataFrame-based API will become the primary API for MLlib.• Voted by community• org.apache.spark.ml, pyspark.ml
The RDD-based API will enter maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib
31
Why ML persistence?
32
Data Science Software Engineering
Prototype (Python/R)Create model
Re-implement model for production (Java)
Deploy model
Why ML persistence?
33
Data Science Software Engineering
Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to
make prediction
• Extra implementation work• Different code paths• Synchronization overhead
Re-implement Pipeline for production (Java)
Deploy Pipeline
With ML persistence...
34
Data Science Software Engineering
Prototype (Python/R)Create Pipeline
Persist model or Pipeline: model.save(“s3n://...”)
Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production
ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete (29 feature transformers, 21 models)• Python: complete except for 2 algorithms• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)
35
A data scientist’s wish list:• Run original code on a production environment• Directly apply learned pipelines• Use MLlib as export format
• Use distributed data sources• Builtin Spark conversions
• Use familiar APIs and libraries• Distribute ML workload piece by piece• Easy to distribute the most common ML tasks
36
What’s next?Prioritized items on the 2.1 roadmap JIRA (SPARK-15581):• Critical feature completeness for the DataFrame-based API
– Multiclass logistic regression– Statistics
• Python API parity & R API expansion• Scaling & speed tuning for key algorithms: trees & ensembles
GraphFrames• Release for Spark 2.0• Speed improvements ( join elimination, connected components)
37
Get started• Get involved: roadmap JIRA (SPARK-15581) + mailing lists
• ML persistence blog post http://databricks.com/blog/2016/05/31
• Try out the Apache Spark 2.0 preview release:http://databricks.com/try
38
Thank you!
spark.apache.orgspark-packages.orgdatabricks.com