Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark...
Transcript of Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark...
![Page 1: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/1.jpg)
© 2014 IBM Corporation
Machine Learning in Spark
Shelly Garion
IBM Research -- Haifa
![Page 2: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/2.jpg)
© 2015 IBM Corporation
Spark MLLib
Large Scale Machine Learning on Apache Spark
2
Meng et.al. "MLLib: Machine Learning in Apache Spark", arXiv:1505:06807, 2015
![Page 3: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/3.jpg)
© 2015 IBM Corporation
Why MLLib?
3
GraphLab?
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 4: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/4.jpg)
© 2015 IBM Corporation
Machine Learning Algorithms
Classification– Logistic regression
– Linear support vector machine (SVM)
– Naïve Bayes
– Decision trees and forests
Regression– Generalized linear regression (GLM)
Recommendation– Alternating least squares (ALS)
Clustering– K-means and Streaming K-means
– Gaussian mixture
– Power iteration clustering (PIC)
– Latent Dirichlet allocation (LDA)
Dimensionality reduction– Singular value decomposition (SVD)
– Principal component analysis (PCA)
Feature extraction & selection
…
4
See: https://spark.apache.org/docs/latest/mllib-guide.html
![Page 5: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/5.jpg)
© 2015 IBM Corporation
Performance of MLLib
It is built on Apache Spark, a fast and general engine for large-scale data processing.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
5
Reza Zadeh, CME 323: Distributed Algorithms and Optimization, Stanford University, http://stanford.edu/~rezab/dao/
https://spark.apache.org/
Logistic Regression
![Page 6: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/6.jpg)
© 2015 IBM Corporation
Performance of MLLib
Speed-up between MLLib versions
6
Meng et.al. "MLLib: Machine Learning in Apache Spark", arXiv:1505:06807, 2015
![Page 7: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/7.jpg)
© 2015 IBM Corporation
Example: K-Means Clustering
Goal:
Segment tweets into clusters by geolocation using Spark MLLib K-means clustering
7
https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/
![Page 8: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/8.jpg)
© 2015 IBM Corporation
Example: K-Means Clustering
8
https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/
![Page 9: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/9.jpg)
© 2015 IBM Corporation
Example: K-Means Clustering
9
https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/
![Page 10: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/10.jpg)
© 2015 IBM Corporation
Example: K-Means Clustering
10
https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/
![Page 11: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/11.jpg)
© 2015 IBM Corporation
Spark Ecosystem Spark SQL & MLLib
11
// SVM using Stochastic Gradient Descent
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 12: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/12.jpg)
© 2015 IBM Corporation
Spark Ecosystem Spark Streaming & MLLib
12
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 13: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/13.jpg)
© 2015 IBM Corporation
Spark Ecosystem GraphX & MLLib
13
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 14: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/14.jpg)
© 2015 IBM Corporation
Machine Learning Pipeline with Spark
Data pre-processing
Feature extraction
Model fitting
Model training
Validation
Model prediction
14
![Page 15: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/15.jpg)
© 2015 IBM Corporation
Machine Learning Pipeline with Spark
15
Patrick Wendell, Matei Zaharia, “Spark community update”, https://spark-summit.org/2015/events/keynote-1/
![Page 16: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/16.jpg)
© 2015 IBM Corporation
Machine Learning Pipeline with Spark
ML Dataset:
– DataFrame from Spark SQL
could have different columns storing text, feature vectors, true labels, and predictions
Transformer:
– Feature transformers (e.g., OneHotEncoder)
– Trained ML models (e.g., LogisticRegressionModel)
Estimator:
– ML algorithms for training models (e.g., LogisticRegression)
Evaluator:
– Evaluate predictions and compute metrics, useful for tuning algorithm parameters
(e.g., BinaryClassificationEvaluator)
Pipeline: chains multiple Transformers and Estimators together to specify an ML workflow
17
![Page 17: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/17.jpg)
© 2015 IBM Corporation
Machine Learning Pipeline with Spark
18
Learning:
https://spark.apache.org/docs/latest/ml-guide.html
Model:
![Page 18: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/18.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
19
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 19: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/19.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
20
Reza Zadeh, CME 323: Distributed Algorithms and Optimization, Stanford University, http://stanford.edu/~rezab/dao/
![Page 20: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/20.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
21
Reza Zadeh, CME 323: Distributed Algorithms and Optimization, Stanford University, http://stanford.edu/~rezab/dao/
![Page 21: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/21.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
22
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 22: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/22.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
23
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 23: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/23.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
24
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 24: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/24.jpg)
© 2015 IBM Corporation
Example: Alternating Least Squares (ALS)
25
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/
![Page 25: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/25.jpg)
© 2015 IBM Corporation
Implementation of ALS in Spark MLLib
26
Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/Reza Zadeh, CME 323: Distributed Algorithms and Optimization, Stanford University, http://stanford.edu/~rezab/dao/
![Page 26: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/26.jpg)
© 2015 IBM Corporation
Implementation of ALS in Spark MLLib
27
vs.
Xiangrui Meng, “A more scalable way of making recommendations with MLLib”, Spark Summit 2015,
http://www.slideshare.net/SparkSummit/a-more-scaleable-way-of-making-recommendations-with-mllib-xiangrui-meng
![Page 27: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/27.jpg)
© 2015 IBM Corporation
Implementation of ALS in Spark MLLib
28
Xiangrui Meng, “A more scalable way of making recommendations with MLLib”, Spark Summit 2015,
http://www.slideshare.net/SparkSummit/a-more-scaleable-way-of-making-recommendations-with-mllib-xiangrui-meng
![Page 28: Machine Learning in Spark - IBM Research People and … · · 2015-10-07Machine Learning in Spark Shelly Garion ... It is built on Apache Spark, ... Joseph Bradley, “Building,](https://reader031.fdocuments.us/reader031/viewer/2022022010/5af952457f8b9a19548c7999/html5/thumbnails/28.jpg)
© 2015 IBM Corporation
References
Meng et.al. "MLLib: Machine Learning in Apache Spark", arXiv:1505:06807, 2015
https://spark.apache.org/docs/latest/mllib-guide.html
Xiangrui Meng, Joseph Bradley, Evan Sparks and Shivaram Venkataraman, “ML Pipelines: A New High-Level API for Mllib”, Databricks blog,
https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
Joseph Bradley, Xiangrui Meng and Burak Yavuz, “New Features in Machine Learning Pipelines in Spark 1.4”, Databricks blog,
https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html
Joseph Bradley, “Building, Debugging, and Tuning Spark Machine Leaning Pipelines”, Spark Summit 2015,
https://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2/
Xiangrui Meng, “A more scalable way of making recommendations with MLLib”, Spark Summit 2015,
https://spark-summit.org/2015/events/a-more-scalable-way-of-making-recommendations-with-mllib/
Joseph Bradley, “Practical Machine Learning Pipelines with MLLib”, Spark Summit East 2015,
https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-22-Joseph-Bradley.pdf
Xiangrui Meng, “Sparse data support in MLLib”, Spark Summit 2014,
https://spark-summit.org/2014/wp-content/uploads/2014/07/sparse_data_support_in_mllib1.pdf
Xiangrui Meng, “MLLib: scalable machine learning on Spark”, Spark Workshop April 2014,
http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
Ameet Talwalkar et al. BerkeleyX: CS190.1x Scalable Machine Learning.
Reza Zadeh, CME 323: Distributed Algorithms and Optimization, Stanford University, http://stanford.edu/~rezab/dao/
30