Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk by Yanbo Liang

Scaling Apache Spark MLlib tobillions of parameters

Yanbo Liang(Apache Spark committer @ Hortonworks)

joint with Weichen Xu

Outline

• Background• Vector-free L-BFGS on Spark• Logistic regression on vector-free L-BFGS• Performance• Integrate with existing MLlib• Future work

Outline

Big data + big models = better accuracies

Spark: a unified platform

“Hidden Technical Debt in Machine Learning Systems”, Google

Spark: a unified platform

“Hidden Technical Debt in Machine Learning Systems”, Google

Optimization

• L-BFGS• SGD• WLS

MLlib logistic regression

Initialize coefficients

Broadcast coefficients to Executors

Comput loss and gradient for each instance, and sum them up locally

Reduce from executors to get

lossSum and gradientSum

Handle regularizationand

use L-BFGS/OWL-QNto find next step Final model

coefficients

Executors/Workers

Driver/Controller

loop untial converge

Bottleneck 1

coefficients

Executors/Workers

Driver/Controller

Solution 1Comput loss and gradient for each instance, and sum them up locally

final lossSum and gradientSum

Reduce fromexecutors to getpartial lossSum

and gradientSum

Reduce fromexecutors to getpartial lossSum

and gradientSum

Bottleneck 2

coefficients

Executors/Workers

Driver/Controller

Outline

Distributed Vector

Driver

[2.5, 6.8, 3.1, 9.0, 0.7, 1.9, 2.6, 8.7, 6.2, 1.1, 5.0, 0.0]

worker0 worker1 worker2 worker3

Distributed Vector

Driver

[2.5, 6.8, 3.1, 9.0, 0.7, 1.9, 2.6, 8.7, 6.2, 1.1, 5.0, 0.0]

[2.5, 6.8, 3.1]

worker0

[9.0, 0.7, 1.9]

worker1

[2.6, 8.7, 6.2]

worker2

[1.1, 5.0, 0.0]

worker3

Distributed Vector

• Definition:

• Linear algebra operations:– a * x + y– a * x + b * y + c * z + …– x.dot(y)– x.norm– …

class DistribuedVector(val values: RDD[Vector],val sizePerPart: Int,val numPartitions: Int,val size: Long)

L-BFGSCoefficients: X(k)

Choose descent direction(Two loop recursion)

X(k+1) = alpha * dir(k) + X(k)

Calculate G(k+1) and F(k+1)

L-BFGSCoefficients: X(k)

Worker Worker Worker Worker

WorkerWorkerWorkerWorker

Driver

Space: (2m+1)dTime: 6md

Vector-free L-BFGSCoefficients: X(k)

Driver

Space: (2m+1)*dTime: 6md

Space: (2m+1)^2Time: 8m^2

Outline

numFeatures

numInstance

partition1

partition2

partition3

numFeatures

numInstance

coefficients: DistributedVectorcoeff0 coeff1 coeff2

coeff2coeff0

coeff0

coeff1

coeff2

multiplier02

multiplier22

multiplier12

multiplier01

multiplier00

multiplier11

multiplier10

multiplier21

multiplier20

multiplier02

multiplier22

multiplier12

multiplier01

multiplier00

multiplier11

multiplier10

multiplier21

multiplier20

label0label2

label1

multiplier0

multiplier2

multiplier1

multiplier0

multiplier2

multiplier1

multiplier0

multiplier1

multiplier2

grad02grad00

grad10

grad20

grad01

grad11

grad21

grad12

grad22

grad0 grad1 grad2 gradient: DistributedVector

Regularization and OWL-QN

• Regularization:– L2– L1– elasticNet

• Implement vector-free OWL-QN for objective with L1/elasticNetregularization.

CheckpointCoefficients: X(k)

Driver

Space: (2m+1)*dTime: 6md

Space: (2m+1)^2Time: 8m^2

Outline

Performance(WIP)

1M 10M 50M 100M 250M 500M 750M 1B

Time for each epoch

L-BFGS VL-BFGS

Outline

val dataset: Dataset[_] = spark.read.format("libsvm").load("data/a9a")

val trainer = new VLogisticRegression().setColsPerBlock(100) .setRowsPerBlock(10) .setColPartitions(3) .setRowPartitions(3) .setRegParam(0.5)

val model = trainer.fit(dataset)println(s"Vector-free logistic regression coefficients:

${model.coefficients}")

Adaptive logistic regression

Number of features Optimization method

less than 4096 WLS/IRLS

more than 4096,but less than 10 million

L-BFGS

more than 10 million VL-BFGS

Whole picture of MLlib

Application layer(Ads CTR, anti-fraud, data science, NLP, …)

WLS/IRLS

Optimization layer

Spark core

L-BFGS VL-BFGS

LinearRegression

Machine learning algorithm layer

LogisticRegression MLP …

SGD …

val dataset: Dataset[_] = spark.read.format("libsvm").load("data/a9a")

val trainer = new LogisticRegression().setColsPerBlock(100) .setRowsPerBlock(10) .setColPartitions(3) .setRowPartitions(3) .setRegParam(0.5).setSolver(“vl-bfgs”)

val model = trainer.fit(dataset)println(s"Vector-free logistic regression coefficients:

${model.coefficients}")

Outline

Future work

• Reduce stages for each iteration.• Performance improvements.• Implement LinearRegression, SoftmaxRegression and

MultilayerPerceptronClassifier base on vector-free L-BFGS/OWL-QN.• Real word use case for Ads CTR prediction with billions parameters

and will share experiences and lessons we learned:– https://github.com/yanboliang/spark-vlbfgs

Key takeaways

• Full distributed calculation, full distributed model, can runsuccessfully without OOM.

• VL-BFGS API is consistent with breeze L-BFGS.• VL-BFGS output exactly the same solution as breeze L-BFGS.• Does not require special components such as parameter servers.• Pure library and can be deployed and used easily.• Can benefit from the Spark cluster operation and development

experience.• Use the same platform as the data size increasing.

Reference

• https://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce

• https://github.com/mengxr/spark-vl-bfgs• https://spark-summit.org/2015/events/large-scale-lasso-and-elastic-net-regularized-generalized-linear-models/

Thank You.

Yanbo Liang

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk by Yanbo Liang

Data & Analytics

Transcript of Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk by Yanbo Liang

Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kursar, Toyota)

Machine Learning using Apache Spark MLlib

Spark MLlib is the Spark component providing the machine ......Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification

Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

Hadoop architecture and ecosystem - polito.itdbdmg.polito.it/.../25_SparkMLlib_Part1_BigData_6x.pdf · Spark MLlib is the Spark component providing the machine learning/data mining

Production Readiness Testing At Salesforce Using Spark MLlib

Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(Sourabh Chaki, [24]7

MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Spark MLlib - Training Material

Recent Developments in Spark MLlib and Beyond

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng

Introduction to Apache Spark and MLlib

MLlib and Machine Learning on Spark

Apache Spark MLlib 2.0 Preview: Data Science and Production

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

DECISION MAKING WITH MLLIB, SPARK AND SPARK STREAMING

Hadoop architecture and ecosystemdbdmg.polito.it/.../25_SparkMLlib_Part1_BigData_2x.pdf · Spark MLlib is the Spark component providing the machine learning/data mining algorithms