Large-Scale Machine Learning with Apache Spark

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)


Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings. Bio: Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.

Transcript of Large-Scale Machine Learning with Apache Spark

  • MLlib: Scalable Machine Learning on Spark Xiangrui Meng
  • What is MLlib? 2
  • What is MLlib? MLlib is a Spark subproject providing machine learning primitives: initial contribution from AMPLab, UC Berkeley shipped with Spark since version 0.8 35 contributors 3
  • What is MLlib? Algorithms:! classication: logistic regression, linear support vector machine (SVM), naive Bayes, classication tree regression: generalized linear models (GLMs), regression tree collaborative ltering: alternating least squares (ALS) clustering: k-means decomposition: singular value decomposition (SVD), principal component analysis (PCA) 4
  • Why MLlib? 5
  • scikit-learn? Algorithms:! classication: SVM, nearest neighbors, random forest, regression: support vector regression (SVR), ridge regression, Lasso, logistic regression, ! clustering: k-means, spectral clustering, decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), 6
  • Mahout? Algorithms:! classication: logistic regression, naive Bayes, random forest, collaborative ltering: ALS, clustering: k-means, fuzzy k-means, decomposition: SVD, randomized SVD, 7
  • Vowpal Wabbit?H2O? R? MATLAB? Mahout? Weka? scikit-learn? LIBLINEAR? 8 GraphLab?
  • Why MLlib? 9
  • It is built on Apache Spark, a fast and general engine for large-scale data processing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala, or Python. Why MLlib? 10
  • Spark philosophy Make life easy and productive for data scientists: Well documented, expressive APIs Powerful domain specic libraries Easy integration with storage systems and caching to avoid data movement
  • Word count (Scala) val counts = sc.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1L)) .reduceByKey(_ + _)
  • Word count (SparkR) lines val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } ! val model = SVMWithSGD.train(training)
  • Streaming + MLlib // collect tweets using streaming ! // train a k-means model val model: KMmeansModel = ... ! // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) ! // print tweets within this particular cluster filteredTweets.print()
  • GraphX + MLlib // assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices ! // load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } ! // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training)
  • Why MLlib? Built on Sparks lightweight yet powerful APIs. Sparks open source community. Seamless integration with Sparks other components. 28 Comparable to or even better than other libraries specialized in large-scale machine learning.
  • Why MLlib? Scalability Performance User-friendly APIs Integration with Spark and its other components 29
  • Logistic regression 30
  • Logistic regression - weak scaling Full dataset: 200K images, 160K dense features. Similar weak scaling. MLlib within a factor of 2 of VWs wall-clock time. MLbase VW Matlab 0 1000 2000 3000 4000 walltime(s) n=6K, d=160K n=12.5K, d=160K n=25K, d=160K n=50K, d=160K n=100K, d=160K n=200K, d=160K MLlib0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib 31
  • Logistic regression - strong scaling Fixed Dataset: 50K images, 160K dense features. MLlib exhibits better scaling properties. MLlib is faster than VW with 16 and 32 machines. MLbase VW Matlab 0 1000 wa Fig. 5: Walltime for weak scaling for logistic regress MLbase VW Matlab 0 200 400 600 800 1000 1200 1400 walltime(s) 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines Fig. 7: Walltime for strong scaling for logistic regress with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented MLlib 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib 32
  • Collaborative ltering 33
  • Collaborative ltering Recover a ra-ng matrix from a subset of its entries. ? ? ? ? ? 34
  • Alternating least squares (ALS) 35
  • ALS - wall-clock time Dataset: scaled version of Netix data (9X in size). Cluster: 9 machines. MLlib is an order of magnitude faster than Mahout. MLlib is within factor of 2 of GraphLab. System Wall-clock /me (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481 36
  • Implementation
  • Implementation of k-means Initialization: random k-means++ k-means||
  • Iterations: For each point, nd its closest center. Update cluster centers. Implementation of k-means li = arg min j kxi cjk2 2 cj = P i,li=j xj P i,li=j 1
  • Implementation of k-means The points are usually sparse, but the centers are most likely to be dense. Computing the distance takes O(d) time. So the time complexity is O(n d k) per iteration. We dont take any advantage of sparsity on the running time. However, we have kx ck2 2 = kxk2 2 + kck2 2 2hx, ci Computing the inner product only needs non-zero elements. So we can cache the norms of the points and of the centers, and then only need the inner products to obtain the distances. This reduce the running time to O(nnz k + d k) per iteration. ! However, is it accurate?
  • Implementation of ALS broadcast everything data parallel fully parallel 41
  • Broadcast everything Master loads (small) data le and initializes models. Master broadcasts data and initial models. At each iteration, updated models are broadcast again. Works OK for small data. Lots of communication overhead - doesnt scale well. Ships with Spark Examples Master Workers Ratings Movie! Factors User! Factors 42
  • Data parallel Workers load data Master broadcasts initial models At each iteration, updated models are broadcast again Much better scaling Works on large datasets Works well for smaller models. (low K) Master Workers Ratings Movie! Factors User! Factors Ratings Ratings Ratings 43
  • Fully parallel Workers load data Models are instantiated at workers. At each iteration, models are shared via join between workers. Much better scalability. Works on large datasets Master Workers RatingsMovie! FactorsUser! Factors RatingsMovie! FactorsUser! Factors RatingsMovie! FactorsUser! Factors RatingsMovie! FactorsUser! Factors 44
  • Implementation of ALS broadcast everything data parallel fully parallel block-wise parallel Users/products are partitioned into blocks and join is based on blocks instead of individual user/product. 45
  • New features for v1.0 Sparse data support Classication and regression tree (CART) Tall-and-skinny SVD and PCA L-BFGS Model evaluation 46
  • MLlib v1.1? Model selection! training multiple models in parallel separating problem/algorithm/parameters/model Learning algorithms! Latent Dirichlet allocation