Machine learning at Scale with Apache Spark

63

Transcript of Machine learning at Scale with Apache Spark

Page 1: Machine learning at Scale with Apache Spark
Page 2: Machine learning at Scale with Apache Spark

Martin Zapletal @zapletal_martinCake Solutions @cakesolutions

Machine learning at scale with Apache Spark

Page 3: Machine learning at Scale with Apache Spark

Scaling computation

● Analytics tools with poor scalability and integration● Manual processes● Slow iterations● Not suitable for large amounts of data

● We want fast iteration, reliability, integration

● Serial implementation● Parallel● GPUs● Distributed

Page 4: Machine learning at Scale with Apache Spark

Scaling neural networks

Page 5: Machine learning at Scale with Apache Spark

Perceptron

● Basic building block of neural networks

a = f(Σ(y * w) + b)

b

Page 6: Machine learning at Scale with Apache Spark

Artificial neural network

● Network training○ Many “optimal” solutions○ Optimization and training techniques - LBFGS,

Backpropagation, batch and online gradient descent, Downpour SGD, Sandblaster LBFGS, …

○ Vanishing gradient, amplifying parameters, ...○ New methods for large networks - deep learning

Page 7: Machine learning at Scale with Apache Spark

XOR

-10.895

1.195

1

0

0.999595

-24.584

-1.159

7.360

-40.119

1.991

35.369

-24.687-53.197

-8.627

-57.122

2.616

61.488

-52.985

-22.904

-67.173

22.172-53.706

27.098-0.375

Output 2.613296075440797E-4 for input Vector(0, 0)Output 0.9989222606269823 for input Vector(0, 1)Output 0.9995952194411893 for input Vector(1, 0)Output 4.0074182099155245E-7 for input Vector(1, 1)

Page 8: Machine learning at Scale with Apache Spark

Scaling computation

● Different programming models, Different languages, Different levels

● Sequential○ R, Matlab, Python, Scala

● Parallel○ Theano, Torch, Caffe, Tensor Flow, Deeplearning4j

Elapsed times for 20 PageRank iterations

[3, 4]

Page 9: Machine learning at Scale with Apache Spark

Machine learning

● Linear algebra● Vectors, matrices, vector spaces, matrix transformations,

eigenvectors/values● Many machine learning algorithms are optimization problems● Goal is to solve them in reasonable (bounded) time ● Goal not always to find the best possible model (data size, feature

engineering vs. algorithm/model complexity)● Goal is to solve them reliably, at scale, support application needs

and improve

[5]

Page 10: Machine learning at Scale with Apache Spark

Distributed environment

● Asynchronous and unreliable

● CAP theorem● Consistency● Availability● Partition tolerance

Page 11: Machine learning at Scale with Apache Spark

Consistency, time and order in DS

● Sequential program always one total order of operations

● No order guarantees in distributed system● At-most-once. Messages may be lost.● At-least-once. Messages may be duplicated but not

lost.● Exactly-once.

Page 12: Machine learning at Scale with Apache Spark

Failure in distributed system

● Node failures, network partitions, message loss, split brains,

inconsistencies

● Microsoft's data centers average failure rate is 5.2 devices per day

and 40.8 links per day, with a median time to repair of approximately

five minutes (and a maximum of one week).

● Google new cluster over one year. Five times rack issues 40-80

machines seeing 50 percent packet loss. Eight network maintenance

events (four of which might cause ~30-minute random connectivity

losses). Three router failures (resulting in the need to pull traffic

immediately for an hour).

● CENIC 500 isolating network partitions with median 2.7 and 32

minutes; 95th percentile of 19.9 minutes and 3.7 days, respectively

for software and hardware problems

[6]

Page 13: Machine learning at Scale with Apache Spark

Failure in distributed system

● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back everything on the new primary

● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being overbilled over a period of 40 minutes.

● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed some newly created repositories.

● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will accept writes on both of those nodes, and later discard the writes to one side.

● RabbitMQ lost ~35% of acknowledged writes under those conditions.● Redis threw away 56% of the writes it told us succeeded.● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the

strongest consistency settings● MongoDB “strictly consistent” reads see stale versions of documents, but they

can also return garbage data from writes that never should have occurred.

[6]

Page 14: Machine learning at Scale with Apache Spark

Algorithm parallelization

Data

computation

computation

computation

Page 15: Machine learning at Scale with Apache Spark

Algorithm parallelization

[7]

Page 16: Machine learning at Scale with Apache Spark

Neural network parallelism

[8]

Page 17: Machine learning at Scale with Apache Spark

import tensorflow as tf

def init_weights(shape):

return tf.Variable(tf.random_normal(shape, stddev=0.01))

def model(X, w_h, w_o):

h = tf.nn.sigmoid(tf.matmul(X, w_h))

return tf.matmul(h, w_o)

X = tf.placeholder("float", [None, 784])

Y = tf.placeholder("float", [None, 10])

w_h = init_weights([784, 625])

w_o = init_weights([625, 10])

py_x = model(X, w_h, w_o)

cost = tf.reduce_mean(

tf.nn.softmax_cross_entropy_with_logits(py_x, Y))

train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost)

predict_op = tf.argmax(py_x, 1)

sess = tf.Session()

init = tf.initialize_all_variables()

sess.run(init)

sess.run(train_op, …)

sess.run(predict_op, …) [9, 10]

Page 18: Machine learning at Scale with Apache Spark

Model parallelism

[11]

Mac

hin

e 1

Mach

ine 2

Mac

hin

e 3

Mach

ine 4

Mac

hin

e 1 Mach

ine 2

Mac

hin

e 3 M

achin

e 4

Page 19: Machine learning at Scale with Apache Spark

Data parallelism

[11]

Data

Data

Page 20: Machine learning at Scale with Apache Spark

Parameter server

● Model and data parallelism● Failures and slow machines● Additional stochasticity due to asynchrony (relaxed

consistency, not up to data parameters, ordering not guaranteed, …)

[11]

Page 21: Machine learning at Scale with Apache Spark

Examples

“Their network for face detection from youtube comprised millions of neurons and 1 billion connection weights. They trained it on a dataset of 10 million 200x200 pixel RGB images to learn 20,000 object categories. The training simulation ran for three days on a cluster of 1,000 servers totaling 16,000 CPU cores. Each instantiation of the network spanned 170 servers”

Google.

“We demonstrate near-perfect weak scaling on a 16 rack IBM Blue Gene/Q (262144 CPUs, 256 TB memory), achieving an unprecedented scale of 256 million neurosynaptic cores containing 65 billion neurons and 16 trillion synapses“

TrueNorth, part of project IBM SyNAPSE. [11, 12]

Page 22: Machine learning at Scale with Apache Spark

Examples

[13]

Page 23: Machine learning at Scale with Apache Spark

Architecture

Page 24: Machine learning at Scale with Apache Spark

Data

Data

Preprocessing

Preprocessing

Features

Features

Training

Testing

Error %

Page 25: Machine learning at Scale with Apache Spark

Data processing pipeline

● Whole lifecycle of data

● Data processing● Data stores● Integration● Distributed computing primitives● Cluster managers and task schedulers● Deployment, configuration management and DevOps● Data analytics and machine learning

Page 26: Machine learning at Scale with Apache Spark

[14]

CQRS

Client

QueryCommand

DBDB

Denormalise/Precompute

Kappa architecture

Batch-Pipeline

Kafka

All

you

r d

ata

NoSQL

SQL

Spark

Client

Client

Client Views

Streamprocessor

Flume

ScoopHive

Impala

Oozie

HDFS

Lambda Architecture

Batch Layer Serving Layer

Stream layer (fast)

Query

Query

All

you

r d

ata

Serving DB

Page 27: Machine learning at Scale with Apache Spark

[15, 16]

Page 28: Machine learning at Scale with Apache Spark

Apache Spark

Page 29: Machine learning at Scale with Apache Spark

Apache Spark

● In memory dataflow distributed data processing framework, streaming and batch

● Distributes computation using a higher level API● Load balancing● Moves computation to data ● Fault tolerant

Page 30: Machine learning at Scale with Apache Spark

Spark distributed programming model

● Resilient Distributed Datasets● Fault tolerance● Caching● Serialization● Transformations

○ Lazy, form the DAG○ map, filter, flatMap, union, group, reduce, sort, join, repartition,

cartesian, glom, ... ● Actions

○ Execute DAG, retrieve result○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...

● Accumulators, Broadcast Variables● SQL● Integration● Streaming● Machine Learning● Graph Processing

Page 31: Machine learning at Scale with Apache Spark

Distributed computation

● Spark streaming● Computing, processing, transforming, analytics

[17]

Page 32: Machine learning at Scale with Apache Spark

textFile mapmapreduceByKey

collect

sc.textFile("counts") .map(line => line.split("\t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect()

[18]

RDD

Page 33: Machine learning at Scale with Apache Spark

Graph lineage

● Master and worker failures

val data2a = data2 .map(x => x.label -> x.features)

val dataa = data .map(x => x.label -> x.features) .union(data2a) .cache()

val data3a = data3 .map(x => x.label -> x.features)

val datab = dataa .join(data3a, 4) .cache() .mapPartitions(it => it.map(x => x._1 + 1 -> x._2)) .groupByKey(4) .reduceByKey((it1, it2) => it1 ++ it2) .collect()

Page 34: Machine learning at Scale with Apache Spark

Optimizations

● Multiple phases● Catalyst

[19]

Page 35: Machine learning at Scale with Apache Spark

Optimizations

[20]

Spark master

Spark worker

Cassandra

Page 36: Machine learning at Scale with Apache Spark

Optimizations

● CPU and memory bottlenecks, not IO● Project Tungsten

○ Explicit memory management and binary processing

○ Cache-aware computation

○ Code generation

● Daytona Gray Sort 100TB Benchmark won by Apache Spark○ Optimized memory layout, shuffle algorithm, ...

[20]

Page 37: Machine learning at Scale with Apache Spark

MLlib

● Data types● Basic statistics

○ summary statistics, correlations, stratified sampling, hypothesis testing, streaming significance testing, random data generation

● Classification and regression○ SVMs, logistic regression, linear regression, naive Bayes, decision trees, ensembles of

trees (Random Forests and Gradient-Boosted Trees), isotonic regression, multilayer perceptron classifier, one-vs-rest classifier, survival regression

● Collaborative filtering○ alternating least squares (ALS)

● Clustering○ k-means, Gaussian mixture, power iteration clustering (PIC), latent Dirichlet allocation

(LDA), bisecting k-means, streaming k-means● Dimensionality reduction

○ singular value decomposition (SVD), principal component analysis (PCA)● Feature extraction and transformation

○ TF-IDF, word2vec, normalizers, scaling● Frequent pattern mining

○ FP-growth, association rules, PrefixSpan● Evaluation metrics● PMML model export● Optimization (developer)

○ stochastic gradient descent, limited-memory BFGS (L-BFGS)

Page 38: Machine learning at Scale with Apache Spark

Example application

Page 39: Machine learning at Scale with Apache Spark

Muvr

[21]

Page 40: Machine learning at Scale with Apache Spark
Page 41: Machine learning at Scale with Apache Spark
Page 42: Machine learning at Scale with Apache Spark
Page 43: Machine learning at Scale with Apache Spark
Page 44: Machine learning at Scale with Apache Spark

7 * Dumbbell AlternatingBicep Curl

Page 45: Machine learning at Scale with Apache Spark

Muvr architecture

Page 46: Machine learning at Scale with Apache Spark

Reactive

● Responsive● Resilient● Elastic● Message driven

Page 47: Machine learning at Scale with Apache Spark

Muvr

● Classify finished (in progress) exercises● Gather data for improved classification● Predict next exercises● Predict weights, intensity● Design a schedule of exercises and improvements

(personal trainer)● Monitor exercise quality

Page 48: Machine learning at Scale with Apache Spark

Scaling model trainingval sc = new SparkContext("local[4]", "NN")

val data = ...

val layers = Array[Int](inputSize, 250, 50, outputSize)

val trainer = new MultilayerPerceptronClassifier()

.setLayers(layers)

.setBlockSize(128)

.setSeed(1234L)

.setMaxIter(100)

val model = trainer.fit(data)

val result = model.transform(data)

println(result.select(result("prediction")).foreach(println))

val predictionAndLabels = result.select("prediction", "label")

val evaluator = new MulticlassClassificationEvaluator()

.setMetricName("precision")

println("Precision:" + evaluator.evaluate(predictionAndLabels))

Page 49: Machine learning at Scale with Apache Spark

Scaling model training

● Deeplearning4j, Neon, Tensor flow on Spark

Model 1 training

Model 2 training

Model 3 training

Best model

Page 50: Machine learning at Scale with Apache Spark

init_norm = Uniform(low=-0.1,high=0.1)

bias_init = Constant(val = 1.0)

layers = []

layers.append(Conv(

fshape = (1, 3, 16),

init=init_norm,

bias=bias_init,

activation=Rectlin()))

layers.append(Pooling(

op="max",

fshape=(2,1),

strides=2))

layers.append(Conv(

fshape = (1, 3, 32),

init=init_norm,

bias=bias_init,

activation=Rectlin()))

layers.append(Pooling(

op="max",

fshape=(2,1),

strides=2))

layers.append(Affine(

nout=100,

init=init_norm,

bias=bias_init,

activation=Rectlin()))

layers.append(Dropout(

name="do_2",

keep = 0.9))

layers.append(Affine(

nout=dataset.num_labels,

init=init_norm,

bias=bias_init,

activation = Logistic()))

return Model(layers=layers)

Page 51: Machine learning at Scale with Apache Spark

backend = gen_backend(

backend='cpu',

batch_size=self.batch_size,

rng_seed=self.random_seed,

stochastic_round=False)

# backend = gen_backend(rng_seed=0, gpu='cudanet')

cost = GeneralizedCost(

name='cost',

costfunc=CrossEntropyMulti())

optimizer = GradientDescentMomentum(

learning_rate=self.lrate,

momentum_coef=0.9)

model.fit(

dataset.train(),

optimizer=optimizer,

num_epochs=self.max_epochs,

cost=cost,

callbacks=callbacks)

Page 52: Machine learning at Scale with Apache Spark

sc \

.cassandraTable(conf["cassandra"]["data_keyspace"], conf["cassandra"]["data_table"]) \

.select("user_id", "model_id", "file_name", "time", "x", "y", "z", "exercise") \

.spanBy("user_id", "model_id") \

.map(train_model_for_user) \

.saveToCassandra(conf["cassandra"]["model_keyspace"], conf["cassandra"]["model_table"])

Page 53: Machine learning at Scale with Apache Spark

[22]

Page 54: Machine learning at Scale with Apache Spark

val events = sc.eventTable().cache().toDF()

val lr = new LinearRegression()

val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(),

new IntensityFeatureExtractor(), lr))

val paramGrid = new ParamGridBuilder()

.addGrid(lr.regParam, Array(0.1, 0.01))

.addGrid(lr.fitIntercept, Array(true, false))

getEligibleUsers(events, sessionEndedBefore)

.map { user =>

val trainValidationSplit =

new TrainValidationSplit()

.setEstimator(pipeline)

.setEvaluator(new RegressionEvaluator)

.setEstimatorParamMaps(paramGrid)

val model = trainValidationSplit.fit(

events,

ParamMap(ParamPair(userIdParam, user)))

val testData = // Prepare test data.

val predictions = model.transform(testData)

submitResult(userId, predictions, config)

}

Page 55: Machine learning at Scale with Apache Spark

Queries and analytics

val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations)

val deviationsFrequency = sqlContext.sql(

"""SELECT planned.exercise, hour(time), COUNT(1)

FROM exerciseDeviations

WHERE planned.exercise = 'bench press'

GROUP BY planned.exercise, hour(time)""")

val deviationsFrequency2 = exerciseDeviationsDF

.where(exerciseDeviationsDF("planned.exercise")

=== "bench press")

.groupBy(

exerciseDeviationsDF("planned.exercise"),

exerciseDeviationsDF("time”))

.count()

val deviationsFrequency3 = exerciseDeviations

.filter(_.planned.exercise == "bench press")

.groupBy(d => (d.planned.exercise, d.time.getHours))

.map(d => (d._1, d._2.size))

Page 56: Machine learning at Scale with Apache Spark

Clustering

def toVector(user: User): mllib.linalg.Vector =

Vectors.dense(

user.frequency,

user.performanceIndex,

user.improvementIndex)

val events: RDD[(JournalKey, Any)] =

sc.eventTable().cache()

val users: RDD[User] = events.filterClass[User]

val kmeans = new KMeans()

.setK(5)

.set...

val clusters = kmeans.run(users.map(_.toVector))

Page 57: Machine learning at Scale with Apache Spark

Recommendations

val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache()

val exerciseDeviations = events

.filterClass[EntireResistanceExerciseSession]

.flatMap(session =>

session.sets.flatMap(set =>

set.sets.map(

exercise => (session.id.id, exercise.exercise))))

.groupBy(e => e)

.map(g =>

Rating(normalize(g._1._1), normalize(g._1._2),

normalize(g._2.size)))

val model = new ALS().run(ratings)

val predictions = model.predict(recommend)

bench press

bicep curl

dead lift

user 1 5 2

user 2 4 3

user 3 5 2

user 4 3 1

Page 58: Machine learning at Scale with Apache Spark

Graph analysis

val events: RDD[(JournalKey, Any)] =

sc.eventTable().cache()

val connections = events.filterClass[Connections]

val vertices: RDD[(VertexId, Long)] =

connections.map(c => (c.id, 1l))

val edges: RDD[Edge[Long]] = connections

.flatMap(c => c.connections

.map(Edge(c.id, _, 1l)))

val graph = Graph(vertices, edges)

val ranks = graph.pageRank(0.0001).vertices

Page 59: Machine learning at Scale with Apache Spark

Conclusions

● Scaling systems, data pipelines and machine learning

● Reactive○ Elasticity○ Resilience○ Responsiveness○ Message driven

Page 60: Machine learning at Scale with Apache Spark

Questions

Page 61: Machine learning at Scale with Apache Spark

Thank you

● Jobs at www.cakesolutions.net/careers● Code at https://github.com/muvr ● Martin Zapletal @zapletal_martin

Page 62: Machine learning at Scale with Apache Spark

References

[1] http://arxiv.org/abs/1112.6209

[2] SuperComputing 2012 two weeks ago and part of the IBM SyNAPSE project

[3] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf

[4] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/

[5] https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html

[6] https://queue.acm.org/detail.cfm?id=2655736

[7] http://fa.bianp.net/blog/2013/isotonic-regression/

[8] http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5

[9] https://github.com/nlintz/TensorFlow-Tutorials/blob/master/3_net.py

[10] https://www.tensorflow.org/

[11] http://static.googleusercontent.com/media/research.google.com/en/us/archive/large_deep_networks_nips2012.pdf

[12] https://www.quora.com/How-big-is-the-largest-feedforward-neural-network-ever-trained-and-what-for

[13] http://static.googleusercontent.com/media/research.google.com/en//archive/unsupervised_icml2012.pdf

[14] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

[15] http://malteschwarzkopf.de/research/assets/google-stack.pdf

[16] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf

[17] https://twitter.com/tsantero/status/695013012525060097

[18] http://www.slideshare.net/LisaHua/spark-overview-37479609

[19] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

[20] https://kayousterhout.github.io/trace-analysis/

[21] https://github.com/muvr

[22] https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html

Page 63: Machine learning at Scale with Apache Spark

Twitter: @cakesolutionsTel: 0845 617 1200

Email: [email protected]