A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

A Scalable Implementation of Deep Learning on SparkAlexander Ulanov 1

Joint work with Xiangrui Meng2, Bert Greevenbosch3

With the help from Guoqiang Li4, Andrey Simanovsky1

1Hewlett-Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark

community

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2

Outline

• Artificial neural network basics

• Implementation of Multilayer Perceptron (MLP) in Spark

• Optimization & parallelization

• Experiments

• Future work


Artificial neural network

Basics

• Statistical model that approximates a function of multiple inputs

• Consists of interconnected “neurons” which exchange messages

– “Neuron” produces an output by applying a transformation function on its

inputs

• Network with more than 3 layers of neurons is called “deep”, instance of deep

learning

Layer types & learning

• A layer type is defined by a transformation function

– Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥𝑖 −1, Convolution, Softmax, etc.

• Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid

layers

• Model parameters – weights that “neurons” use for transformations

• Parameters are iteratively estimated with the backpropagation algorithm

Multilayer perceptron

• Speech recognition (phoneme classification), computer vision

• Introduced in Spark 1.5.0

𝑥

𝑦

input

output

hidden layer


Example of MLP in Spark

Handwritten digits recognition

• Dataset MNIST [LeCun et al. 1998]

• 28x28 greyscale images of handwritten digits 0-9

• MLP with 784 inputs, 10 outputs and two hidden layers of

300 and 100 neurons

val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")

val mlp = new MultilayerPerceptronClassifier()

.setLayers(Array(784, 300, 100, 10))

.setBlockSize(128)

val model = mlp.fit(digits)

784 inputs 300

neurons100 neurons 10 neurons

1st hidden

layer

2nd hidden layer Output

layer

digits = sqlContext.read.format("libsvm").load("/data/mnist")

mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)

model = mlp.fit(digits)

Scala

Python


Pipeline with PCA+MLP in Spark

val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)

val pca = new PCA()

.setInputCol(“features”)

.setK(20)

.setOutPutCol(“features20”)

val mlp = new MultilayerPerceptronClassifier()

.setFeaturesCol(“features20”)

.setLayers(Array(20, 50, 10))

.setBlockSize(128)

val pipeline = new Pipeline()

.setStages(Array(pca, mlp))

val model = pipeline.fit(digits)

digits = sqlContext.read.format("libsvm").load("/data/mnist8m")

pca = PCA(inputCol="features", k=20, outputCol="features20")

mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],

blockSize=128)

pipeline = Pipeline(stages=[pca, mlp])

model = pipeline.fit(digits)

Scala

Python


MLP implementation in Spark

Requirements

• Conform to Spark APIs

• Extensible interface (deep learning API)

• Efficient and scalable (single node & cluster)

Why conform to Spark APIs?

• Spark can call any Java, Python or Scala library, not necessary designed for Spark

– Results with expensive data movement from Spark RDD to the library

– Prohibits from using for Spark ML Pipelines

Extensible interface

• Our implementation processes each layer as a black box with backpropagation in general

form

– Allows further introduction of new layers and features

• CNN, Autoencoder, RBM are currently under dev. by community


Efficiency

Batch processing

• Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊𝑇𝒙 + 𝒃

– 𝒚 – output from the layer, vector of size 𝑛

– 𝑊 – the matrix of layer weights 𝑚× 𝑛 , 𝒃 – bias, vector of size 𝑚

– 𝒙 – input to the layer, vector of size 𝑚

• Vector-matrix multiplications are not as efficient as matrix-matrix

– Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊𝑇𝑿+ 𝑩

– 𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 ,

– 𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃

• We implemented batch processing in matrix form

– Enabled the use of optimized native BLAS libraries

– Memory is reused to limit GC overhead

= * +

= * +


1.00E-04

1.00E-03

1.00E-02

1.00E-01

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

(1x1

) *

(1x1

)

(10x1

0)

* (1

0x1

)

(10x1

0)

* (1

0x1

0)

(100

x10

0)

* (1

00

x1

)

(100

x10

0)

* (1

00

x1

0)

(100

x10

0)

* (1

00

x1

00)

(100

0x1

000

) *

(100

0x1

00)

(100

0x1

000

) *

(100

0x1

000

)

(100

00

x1

00

00

) *

(10

00

0x1

000

)

(100

00

x1

00

00

) *

(10

00

0x1

000

0)

dgemm performance

netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas

Single node BLAS

BLAS in Spark

• BLAS – Basic Linear Algebra Subprograms

• Hardware optimized native in C & Fortran

– CPU: MKL, OpenBLAS etc.

– GPU: NVBLAS (F-BLAS interface to CUDA)

• Use in Spark through Netlib-java

Experiments

• Huge benefit from native BLAS vs pure Java

f2jblas

• GPU is faster (2x) only for large matrices

– When compute is larger than copy to/from

GPU

• More details:

– https://github.com/avulanov/scala-blas

– “linalg: Matrix Computations in Apache

Spark” Reza et al., 2015

CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM

GPU: Tesla M2050 3GB, 575MHz, 448 CUDA

cores

seconds

Matrices sizes


Scalability

Parallelization

• Each iteration 𝑘, each node 𝑖

– 1. Gets parameters 𝑤𝑘 from master

– 2. Computes a gradient 𝛻𝑖𝑘𝐹(𝑑𝑎𝑡𝑎𝑖)

– 3. Sends a gradient to master

– 4. Master computes 𝑤𝑘+1 based on gradients

• Gradient type

– Batch – process all data on each iteration

– Stochastic – random point

– Mini-batch – random batch

• How many workers to use?

– Less workers – less compute

– More workers – more communication

𝑤𝑘

𝑤𝑘+1 ≔ 𝑌 𝛻𝑖𝑘𝐹

Master

Executor

1

Executor

N

Partition 1

Partition 2

Partition P

Executor

1

Executor

N

V

V

v

𝛻1𝑘𝐹(𝑑𝑎𝑡𝑎1)

𝛻𝑁𝑘𝐹(𝑑𝑎𝑡𝑎𝑁)

𝛻1𝑘𝐹

Master

Executor

1

Executor

N

MasterV

V

v

1.

2.

3.

4.GoTo #1


Communication and computation trade-off

Parallelization of batch gradient

• There are 𝑑 data points, 𝑓 features and 𝑘 classes

– Assume, we want to train logistic regression, it has 𝑓𝑘 parameters

• Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with

bandwidth 𝑏 and software overhead 𝑐. Use all-reduce:

– 𝑡𝑐𝑚 = 264𝑓𝑘

𝑏+ 𝑐 log2 𝑛

• Computation: each worker has 𝑝 FLOPS and processes 𝑑

𝑛of data, that needs 𝑓𝑘 operations

– 𝑡𝑐𝑝~𝑑

𝑛

𝑓𝑘

𝑝

• What is the optimal number of workers?

– min𝑛

𝑡𝑐𝑚 + 𝑡𝑐𝑝 ⇒ 𝑛 = 𝑚𝑎𝑥𝑑𝑓𝑘 ln 2

𝑝 128𝑓𝑘 𝑏+2𝑐, 1

– 𝑚𝑎𝑥𝑑∙𝑤∙ln 2

𝑝 128𝑤 𝑏+2𝑐, 1 , if 𝑤 is the number of model parameters and floating point operations


Analysis of the trade-off

Optimal number of workers for batch gradient

• Parallelism in a cluster

– 𝑛 = 𝑚𝑎𝑥𝑑∙𝑤∙ln 2

𝑝 128𝑤 𝑏+2𝑐, 1

• Analysis

– More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster

– More operations, i.e. more features and classes 𝑤 = 𝑓𝑘 (or a deep network) means higher degree

– Small 𝑐 overhead for get/receive a message means higher degree

• Example: MNIST8M handwritten digit recognition dataset

– 8.1M documents, 784 features, 10 classes, logistic regression

– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s

– 𝑛 = 𝑚𝑎𝑥8.1𝑀∙784∙10∙0.69

32𝐺 128∙784∙10 1𝐺+2∙0.1, 1 = 6


0

20

40

60

80

100

0 1 2 3 4 5 6

Spark MLP vs Caffe MLP

MLP (total) MLP (compute)

Caffe CPU Caffe GPU

Scalability testing

Setup

• MNIST character recognition 60K samples

• 6-layer MLP (784,2500,2000,1500,1000,500,10)

• 12M parameters

• CPU: Xeon X5650 @ 2.67GHz

• GPU: Tesla M2050 3GB, 575MHz

• Caffe (Deep Learning from Berkeley): 1 node

• Spark: 1 master + 5 workers

Results per iteration

• Single node (both tools double precision)

– 1.6 slower than Caffe CPU (Scala vs C++)

• Scalability

– 5 nodes give 4.7x speedup, beats Caffe, close to

GPU

Seconds

Workers

Co

mm

un

ica

tio

n

co

st

𝑛 = 𝑚𝑎𝑥60𝐾 ∙ 12𝑀 ∙ 0.69

64𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1, 1 = 𝟒


Conclusions & future work

Conclusions

• Scalable multilayer perceptron is available in Spark 1.5.0

• Extensible internal API for Artificial Neural Networks

– Further contributions are welcome!

• Native BLAS (and GPU) speeds up Spark

• Heuristics for parallelization of batch gradient

Work in progress [SPARK-5575]

• Autoencoder(s)

• Restricted Boltzmann Machines

• Drop-out

• Convolutional neural networks

Future work

• SGD & parameter server

A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov

Data & Analytics

Transcript of A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov