A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
-
Upload
spark-summit -
Category
Data & Analytics
-
view
323 -
download
1
Transcript of A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
A Scalable Implementation of Deep Learning on SparkAlexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett-Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark
community
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2
Outline
• Artificial neural network basics
• Implementation of Multilayer Perceptron (MLP) in Spark
• Optimization & parallelization
• Experiments
• Future work
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3
Artificial neural network
Basics
• Statistical model that approximates a function of multiple inputs
• Consists of interconnected “neurons” which exchange messages
– “Neuron” produces an output by applying a transformation function on its
inputs
• Network with more than 3 layers of neurons is called “deep”, instance of deep
learning
Layer types & learning
• A layer type is defined by a transformation function
– Affine: 𝑦𝑗 = 𝒘𝒊𝒋 ∙ 𝑥𝑖 + 𝑏𝑗, Sigmoid: 𝑦𝑖 = 1 + 𝑒−𝑥𝑖 −1, Convolution, Softmax, etc.
• Multilayer perceptron (MLP) – a network with several pairs of Affine & Sigmoid
layers
• Model parameters – weights that “neurons” use for transformations
• Parameters are iteratively estimated with the backpropagation algorithm
Multilayer perceptron
• Speech recognition (phoneme classification), computer vision
• Introduced in Spark 1.5.0
𝑥
𝑦
input
output
hidden layer
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4
Example of MLP in Spark
Handwritten digits recognition
• Dataset MNIST [LeCun et al. 1998]
• 28x28 greyscale images of handwritten digits 0-9
• MLP with 784 inputs, 10 outputs and two hidden layers of
300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300
neurons100 neurons 10 neurons
1st hidden
layer
2nd hidden layer Output
layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5
Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)
val pca = new PCA()
.setInputCol(“features”)
.setK(20)
.setOutPutCol(“features20”)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(“features20”)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
MLP implementation in Spark
Requirements
• Conform to Spark APIs
• Extensible interface (deep learning API)
• Efficient and scalable (single node & cluster)
Why conform to Spark APIs?
• Spark can call any Java, Python or Scala library, not necessary designed for Spark
– Results with expensive data movement from Spark RDD to the library
– Prohibits from using for Spark ML Pipelines
Extensible interface
• Our implementation processes each layer as a black box with backpropagation in general
form
– Allows further introduction of new layers and features
• CNN, Autoencoder, RBM are currently under dev. by community
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
Efficiency
Batch processing
• Layer’s affine transformations can be represented in vector form: 𝒚 = 𝑊𝑇𝒙 + 𝒃
– 𝒚 – output from the layer, vector of size 𝑛
– 𝑊 – the matrix of layer weights 𝑚× 𝑛 , 𝒃 – bias, vector of size 𝑚
– 𝒙 – input to the layer, vector of size 𝑚
• Vector-matrix multiplications are not as efficient as matrix-matrix
– Stack 𝑠 input vectors (into batch) to perform matrices multiplication: 𝒀 = 𝑊𝑇𝑿+ 𝑩
– 𝑿 is 𝑚 × 𝑠 , 𝒀 is 𝑛 × 𝑠 ,
– 𝑩 is 𝑛 × 𝑠 , each column contains a copy of 𝒃
• We implemented batch processing in matrix form
– Enabled the use of optimized native BLAS libraries
– Memory is reused to limit GC overhead
= * +
= * +
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1x1
) *
(1x1
)
(10x1
0)
* (1
0x1
)
(10x1
0)
* (1
0x1
0)
(100
x10
0)
* (1
00
x1
)
(100
x10
0)
* (1
00
x1
0)
(100
x10
0)
* (1
00
x1
00)
(100
0x1
000
) *
(100
0x1
00)
(100
0x1
000
) *
(100
0x1
000
)
(100
00
x1
00
00
) *
(10
00
0x1
000
)
(100
00
x1
00
00
) *
(10
00
0x1
000
0)
dgemm performance
netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas
Single node BLAS
BLAS in Spark
• BLAS – Basic Linear Algebra Subprograms
• Hardware optimized native in C & Fortran
– CPU: MKL, OpenBLAS etc.
– GPU: NVBLAS (F-BLAS interface to CUDA)
• Use in Spark through Netlib-java
Experiments
• Huge benefit from native BLAS vs pure Java
f2jblas
• GPU is faster (2x) only for large matrices
– When compute is larger than copy to/from
GPU
• More details:
– https://github.com/avulanov/scala-blas
– “linalg: Matrix Computations in Apache
Spark” Reza et al., 2015
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA
cores
seconds
Matrices sizes
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9
Scalability
Parallelization
• Each iteration 𝑘, each node 𝑖
– 1. Gets parameters 𝑤𝑘 from master
– 2. Computes a gradient 𝛻𝑖𝑘𝐹(𝑑𝑎𝑡𝑎𝑖)
– 3. Sends a gradient to master
– 4. Master computes 𝑤𝑘+1 based on gradients
• Gradient type
– Batch – process all data on each iteration
– Stochastic – random point
– Mini-batch – random batch
• How many workers to use?
– Less workers – less compute
– More workers – more communication
𝑤𝑘
𝑤𝑘+1 ≔ 𝑌 𝛻𝑖𝑘𝐹
Master
Executor
1
Executor
N
Partition 1
Partition 2
Partition P
Executor
1
Executor
N
V
V
v
𝛻1𝑘𝐹(𝑑𝑎𝑡𝑎1)
𝛻𝑁𝑘𝐹(𝑑𝑎𝑡𝑎𝑁)
𝛻1𝑘𝐹
Master
Executor
1
Executor
N
MasterV
V
v
1.
2.
3.
4.GoTo #1
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10
Communication and computation trade-off
Parallelization of batch gradient
• There are 𝑑 data points, 𝑓 features and 𝑘 classes
– Assume, we want to train logistic regression, it has 𝑓𝑘 parameters
• Communication: 𝑛 workers get/receive 𝑓𝑘 64 bit parameters through the network with
bandwidth 𝑏 and software overhead 𝑐. Use all-reduce:
– 𝑡𝑐𝑚 = 264𝑓𝑘
𝑏+ 𝑐 log2 𝑛
• Computation: each worker has 𝑝 FLOPS and processes 𝑑
𝑛of data, that needs 𝑓𝑘 operations
– 𝑡𝑐𝑝~𝑑
𝑛
𝑓𝑘
𝑝
• What is the optimal number of workers?
– min𝑛
𝑡𝑐𝑚 + 𝑡𝑐𝑝 ⇒ 𝑛 = 𝑚𝑎𝑥𝑑𝑓𝑘 ln 2
𝑝 128𝑓𝑘 𝑏+2𝑐, 1
– 𝑚𝑎𝑥𝑑∙𝑤∙ln 2
𝑝 128𝑤 𝑏+2𝑐, 1 , if 𝑤 is the number of model parameters and floating point operations
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11
Analysis of the trade-off
Optimal number of workers for batch gradient
• Parallelism in a cluster
– 𝑛 = 𝑚𝑎𝑥𝑑∙𝑤∙ln 2
𝑝 128𝑤 𝑏+2𝑐, 1
• Analysis
– More FLOPS 𝑝 means lower degree of batch gradient parallelism in a cluster
– More operations, i.e. more features and classes 𝑤 = 𝑓𝑘 (or a deep network) means higher degree
– Small 𝑐 overhead for get/receive a message means higher degree
• Example: MNIST8M handwritten digit recognition dataset
– 8.1M documents, 784 features, 10 classes, logistic regression
– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
– 𝑛 = 𝑚𝑎𝑥8.1𝑀∙784∙10∙0.69
32𝐺 128∙784∙10 1𝐺+2∙0.1, 1 = 6
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12
0
20
40
60
80
100
0 1 2 3 4 5 6
Spark MLP vs Caffe MLP
MLP (total) MLP (compute)
Caffe CPU Caffe GPU
Scalability testing
Setup
• MNIST character recognition 60K samples
• 6-layer MLP (784,2500,2000,1500,1000,500,10)
• 12M parameters
• CPU: Xeon X5650 @ 2.67GHz
• GPU: Tesla M2050 3GB, 575MHz
• Caffe (Deep Learning from Berkeley): 1 node
• Spark: 1 master + 5 workers
Results per iteration
• Single node (both tools double precision)
– 1.6 slower than Caffe CPU (Scala vs C++)
• Scalability
– 5 nodes give 4.7x speedup, beats Caffe, close to
GPU
Seconds
Workers
Co
mm
un
ica
tio
n
co
st
𝑛 = 𝑚𝑎𝑥60𝐾 ∙ 12𝑀 ∙ 0.69
64𝐺 128 ∙ 12𝑀 950𝑀 + 2 ∙ 0.1, 1 = 𝟒
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13
Conclusions & future work
Conclusions
• Scalable multilayer perceptron is available in Spark 1.5.0
• Extensible internal API for Artificial Neural Networks
– Further contributions are welcome!
• Native BLAS (and GPU) speeds up Spark
• Heuristics for parallelization of batch gradient
Work in progress [SPARK-5575]
• Autoencoder(s)
• Restricted Boltzmann Machines
• Drop-out
• Convolutional neural networks
Future work
• SGD & parameter server
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you