Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon University at MLconf SF -...
-
Upload
mlconf -
Category
Technology
-
view
6.883 -
download
1
Transcript of Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon University at MLconf SF -...
Marianas Labs
Alex SmolaCMU & Marianas Labs
github.com/dmlc
Fast, Cheap and DeepScaling machine learning
SFW
Marianas Labs
This Talk in 3 Slides(the systems view)
Marianas Labs
Parameter Server
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Server
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Server
Marianas Labs
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Parameter Server
Multicore
Marianas Labs
Data (local or cloud)
read
write
Data (local or cloud)
local state
(or copy)
update
Parameter Server
GPUs (for Deep Learning)
Marianas Labs
Details• Parameter Server Basics
Logistic Regression (Classification)• Memory Subsystem
Matrix Factorization (Recommender)• Large Distributed State
Factorization Machines (CTR)• GPUs
MXNET and applications• Models
Marianas Labs
Estimate Click Through Rate
p(click|{z}=:y
| ad, query| {z }=:x
, w)
Marianas Labs
sparse models for advertising
Click Through Rate (CTR)• Linear function class
• Logistic regression
• Optimization Problem
• Solve distributed over many machines(typically 1TB to 1PB of data)
f(x) = hw, xi
minimize
w� log p(f |X,Y )
minimize
w
mX
i=1
log(1 + exp(�yi hw, xii)) + � kwk1
p(y|x,w) = 1
1 + exp (�y hw, xi)
Marianas Labs
Optimization Algorithm• Compute gradient on data• l1 norm is nonsmooth, hence proximal operator
• Updates for l1 are very simple
• All steps decompose by coordinates• Solve in parallel (and asynchronously)
argminw
kwk1 +�
2kw � (wt � ⌘gt)k
wi sgn(wi)max(0, |wi|� ✏)
2
Marianas Labs
Parameter Server Template
• Compute gradient on (subset of data) on each client • Send gradient from client to server asynchronouslypush(key_list,value_list,timestamp)
• Proximal gradient update on server per coordinate• Server returns parameterspull(key_list,value_list,timestamp)
Smola & Narayanamurthy, 2010, VLDBGonzalez et al., 2012, WSDMDean et al, 2012, NIPSShervashidze et al., 2013, WWWGoogle, Baidu, Facebook, Amazon, Yahoo, Microsoft
Marianas Labs
Each key segment is then duplicated into the k anti-clockwise neighbor server nodes for fault tolerance. Ifk = 1, then the segment with the mark in the example willbe duplicated at Server 3. A new node comes is first ran-domly (via a hash function) inserted into the ring, and thentakes the key segments from its clockwise neighbors. Onthe other hand, if a node is removed or if it fails, its seg-ments will be served by its nearest anticlockwise neigh-bors, who already own a duplicated copy if k > 0. Torecover a failed node, we just insert a node back into thefailed node’s previous positions and then request the seg-ment data from its anticlockwise neighbors.
4.5 Node Join and Leave
5 Evaluation5.1 Sparse Logistic RegressionSparse logistic regression is a linear binary classifier,which combines a logit loss with a sparse regularizer:
min
w2Rp
nX
i=1
log(1 + exp(�yi hxi, wi)) + �kwk1,
where the regularizer kwk1 has a desirable property tocontrol the number of non-zero entries in the optimal solu-tion w
⇤, but its non-smooth property makes this objectivefunction hard to be solved.
We compared parameter server with two specific-purpose systems developed by an Internet company. Forprivacy purpose, we name them System-A, and System-B respectively. The former uses an variant of the well-known L-BFGS [21, 5], while the latter runs an variantof block proximal gradient method [24], which updates ablock of parameters at each iteration according to the first-order and diagonal second-order gradients. Both systemsuse sequential consistency model, but are well optimizedin both computation and communication.
We re-implemented the algorithm used by System-B onparameter server. Besides, we made two modifications.One is that we relax the consistency model to boundeddelay. The other one is a KKT filter to avoid sending gra-dients which may do not affect the parameters.
Specifically, let gk be the global (first-order) gradienton feature k at iteration t. Then, the according parameterwk will not be changed at this iteration if wk = 0 and�� gk � due to the update rule. Therefore it is notnecessary for workers to send gk at this iteration. But aworker does not know the global gk without communica-tion, instead, we let a worker i approximate gk based onits local gradient gik by g̃k = ckg
ik/c
ik, where ck is the
global number of nonzero entries on feature k and c
ik is
the local count, which are constants and can be obtainedbefore iterating. Then, the worker skips sending gk if
wk = 0 and � �+� g̃k ���,
where � 2 [0,�] is user defined constant.
Method Consistency LOCSystem-A L-BFGS Sequential 10,000System-B Block PG Sequential 30,000Parameter Block PG Bounded Delay 300Server KKT Filter
Table 3: xx
These there systems are compared in Table 3. Notably,both System-A and System-B consist of more than 10Klines of code, but parameter server only uses less than 300.
To demonstrate the efficiency of parameter server, wecollected a computational advertisement dataset with 170Billions of examples and 65 Billions of unique features.The raw text data size is 636 TB, and the compressed for-mat is 141 TB. We run these systems on 1000 machines,each one has 16 cores, 192GB memory, and are connectedby 10GB Ethernet. For parameter server, we use 800 ma-chines to form the worker group. Each worker cachesaround 1 billions of parameters. The rest 200 machinesmake the server group, where each machine runs 10 (vir-tual) server nodes.
10−1
100
101
1010.6
1010.7
time (hour)
obje
ctiv
e v
alu
e
System−ASystem−BParameter Server
Figure 8: Convergence results of sparse logistic regres-sion, the goal is to achieve small objective value usingless time.
We run these three systems to achieve the same objec-tive value, the less time used the better. Both system-Band parameter server use 500 blocks. In addition, param-eter server fix ⌧ = 4 for the bounded delay, which meanseach worker can parallel executes 4 blocks.
8
Solving it at scale• 2014 - Li et al., OSDI’14
• 500 TB data, 1011 variables• Local file system stores files• 1000 servers (corp cloud), • 1h time, 140 MB/s learning
• 2015 - Online solver• 1.1 TB (Criteo), 8·108 variables, 4·109 samples• S3 stores files (no preprocessing) - better IO library• 5 machines (c4.8xlarge), • 1000s time, 220 MB/s learning
2014
Marianas Labs
Details• Parameter Server Basics
Logistic Regression (Classification)• Memory Subsystem
Matrix Factorization (Recommender)• Large Distributed State
Factorization Machines (CTR)• GPUs
MXNET and applications• Models
Marianas Labs
Recommender Systems• Users u, movies m (or projects)• Function class
• Loss function for recommendation (Yelp, Netflix)
rum = hvu, wmi+ bu + bm
X
u⇠m
(hvu, wmi+ bu + bm � yum)2
Marianas Labs
Recommender Systems• Regularized Objective
• Update operations
• Very simple SGD algorithm (random pairs)• This should be cheap …
X
u⇠m
(hvu, wmi+ bu + bm + b0
� rum)2 +�
2
hkUk2
Frob
+ kV k2Frob
i
vu (1� ⌘t�)vu � ⌘twm (hvu, wmi+ bu + bm + b0 � rum)
wm (1� ⌘t�)wm � ⌘tvu (hvu, wmi+ bu + bm + b0 � rum)
memory subsystem
Marianas Labs
This should be cheap …• O(md) burst reads and O(m) random reads• Netflix dataset
m = 100 million, d = 2048 dimensions, 30 steps• Runtime should be > 4500s
• 60 GB/s memory bandwidth = 3300s• 100 ns random reads = 1200s
We get 560s. Why?Liu, Wang, Smola, RecSys 2015
Marianas Labs
Power law in Collaborative Filtering
# of ratings100 102 104 106
Mov
ie it
em c
ount
s
100
101
102
103
104 Netflix dataset
# ratings
# m
ovie
s
Marianas Labs
Key Ideas• Stratify ratings by users
(only 1 cache miss / read per user / out of core)• Keep frequent movies in cache
(stratify by blocks of movie popularity)• Avoid false sharing between sockets
(key cached in the wrong CPU causes miss)
Marianas Labs
Key Ideas
GraphChi Partitioning
Marianas Labs
Key Ideas
SC-SGD partitioning
Marianas Labs
Speed (c4.8xlarge)
Num of Dimensions0 500 1000 1500 2000
Seco
nds
0
100
200
300
400
500
600
700
800
900
1000
C-SGDFast SGLDGraphchiGraphlabBIDMach
Num of Dimensions0 500 1000 1500 2000
Seco
nds
0
1000
2000
3000
4000
5000
6000
C-SGDFast SGLDGraphchiGraphlab
Netflix - 100M, 15 iterations Yahoo - 250M, 30 iterations
g2.8xlarge
Marianas Labs
Convergence• GraphChi blocks
(users, movies) into random groups
• Poor mixing• Slow convergence
Runtime in sec (30 epoches in total)0 1000 2000 3000 4000 5000 6000
Test
ing
RM
SE
1.2
1.4
1.6
1.8
2
2.2
2.4
C-SGD at k=2048GraphChi SGD at k=2048Fast SGLD at k=2048
Marianas Labs
Details• Parameter Server Basics
Logistic Regression (Classification)• Memory Subsystem
Matrix Factorization (Recommender)• Large Distributed State
Factorization Machines (CTR)• GPUs
MXNET and applications• Models
Marianas Labs
A Linear Model is not enough
p(y|x,w)
Marianas Labs
Factorization Machines• Linear Model
• Polynomial Expansion (Rendle, 2012)
memory hogmemory hog
too large for individual machine
f(x) = hw, xi
f(x) = hw, xi+X
i<j
xixj tr⇣V
(2)i ⌦ V
(2)j
⌘+
X
i<j<k
xixjxk tr⇣V
(3)i ⌦ V
(3)j ⌦ V
(3)k
⌘+ . . .
Marianas Labs
Prefetching to the rescue• Most keys are infrequent
(power law distribution)• Prefetch the embedding
vectors for a minibatch fromparameter server
• Compute gradients and push to server• Variable dimensionality embedding• Enforcing sparsity (ANOVA style)• Adaptive gradient normalization• Frequency adaptive regularization (CF style)
feature occurs in a minibatch. The consequence of this isthat parameters associated with frequent features are lessoverregularized. The penalty can be presented as
⌦[w, V ] =12
X
i
⇥�i
w2i
+ µi
kVi
k22⇤with �
i
/ µi
. (11)
Lemma 2 (Dynamic Regularization) Assume that we solvea factorization machines problem with the following updatesin a minibatch update setting: for each feature i,
Ii
I {[xj
]i
6= 0 for some (xj
, yj
) 2 B} (12)
wi
wi
� ⌘t
b
X
(xj ,yj)2B
@wi l(f(xj
), yj
)� ⌘t
�wi
Ii
(13)
Vi
Vi
� ⌘t
b
X
(xj ,yj)2B
@Vi l(f(xj
), yj
)� ⌘t
µVi
Ii
(14)
Here B denotes the minibatch and b the according size, andIi
denotes whether or not feature i appears in this minibatch.Then e↵ective regularization is given by
�i
= �⇢i
and µi
= µ⇢i
where ⇢i
= 1� (1� ni
/m)b ⇡ ni
b
m
Proof. The probability that a particular feature occursin a random minibatch is ⇢
i
= 1 � (1 � ni
/m)b. Observethat while the amount of regularization on the minibatchesis not independent, it is additive (and exchangeable). Hencethe expected amount of regularizations are ⇢
i
� and ⇢i
µ, re-spectively. Expanding the Taylor series in ⇢
i
yields the ap-proximation.
Note that for b = 1 we obtain the conventional frequency de-pendent regularization, whereas in the batch setting b = mwe obtain the Frobenius regularization. That is, choosingthe minibatch size conveniently allows us to interpolate be-tween both extremes e�ciently.
3.4 Putting All Things TogetherWe obtain the following model and optimization problem:
minimizew,V
1|X|
X
(x,y)
l(f(x), y) + �1 kwk1
+12
X
i
⇥�i
w2i
+ µi
kVi
k22⇤
(15)
subject to Vi
= 0 if wi
= 0 and Vij
= 0 for j > ki
where the choice of constants �1, �i
, µi
and ki
are as dis-cussed above. This model di↵ers in two key parts from stan-dard FM models: We add frequency adaptive regularizationto better model the possible nonuniform sparsity pattern inthe data. Secondly, we use sparsity regularization and mem-ory adaptive constraints to control the model size, whichbenefits both the statistical model and system performance.
4. DISTRIBUTED OPTIMIZATIONMinimizing (15) is challenging since the potentially large
model brings large communication cost. We focus on asyn-chronous optimization method, which hides synchronizationcost by communicating and computing in parallel. It hasbeen shown that asynchronous block subspace descent cane↵ectively solve nonconvex objective function with nons-mooth `1 regularizer [8]. However, it requires expensive
1 2 3
server nodes:
worker nodes:
t=1 t=1 t=2t=2 t=3 t=4
Figure 1: An illustration of asynchronous SGD. WorkersW1 and W2 first pull weights from the servers at time 1.Next W1 pushes the gradient back and therefore increasethe timestamp at the servers. Worker W3 pulls the weightimmediately after that. Then W2 and W3 push the gradientback. The delays for W1, W2, and W3 are 0, 1, and 1.
data preprocessing. In this paper, we consider asynchronousstochastic gradient descent (SGD) [8] for optimization.
4.1 Asynchronous Stochastic Gradient DescentFor simplicity we assume there is a single (virtual) server
node, which maintains the model w and V (the parameterserver framework provides for a clean abstraction for mul-tiple servers). Multiple workers run SGD independently ofeach other. Each worker repeatedly reads data, pulls recentmodel from the server, computes gradient, and then pushesback the gradient. Due to the network delay, the receivedgradient the server used to update the model may be notcomputed based on the most recent model. An example isillustrated in Figure 1.A worker can calculate the gradient of the loss in the stan-
dard way [14]. We first compute the partial gradient of f
@wif(x,w, V ) = x
i
(16)
@Vijf(x,w, V ) = x
i
[V x]j
� x2i
Vij
, (17)
Note that the term V x can be pre-computed. Invoking thechain rule yields the gradient of l. Denote by w(t) and V (t)the model stored at the server on time t. Assume that attime t the server received gradient pushed from one worker
g✓(t) @✓
l(f(x,w(t� ⌧), V (t� ⌧)), y) (18)
where ✓ can be either w or V , and ⌧ is the delay, indicatingthat the gradient was computed using the model at timet� ⌧ .As mentioned in Section 3.3, we use frequency adaptive
regularization for w and V and the sparse induce `1 normfor w. In addition, we use AdaGrad [4] to better model thepossible nonuniform sparsity in the data. In other words,assume � the scalar in (14), and scalars ⌘
V
and �V
, theserver updates V
ij
by
nij
nij
+hgVij
(t)i2
Vij
(t+ 1) Vij
(t)� ⌘V
�V
+pnij
⇣gVij
(t) + µVij
(t)⌘, (19)
where nij
is initialized to 0. Updating w is slightly di↵er-ent to V due to the nonsmooth `1 regularizer. We adoptedFTRL [10], which solves a “smoothed” proximal operatorbased on AdaGrad. Similarly denote by ⌘
w
and �w
the
Marianas Labs
Better Models
100
101
102
106
108
1010
1012
dimesion (k)
mod
el s
ize
no mem adaptionfreqencyfreqency + l1 shrk
100
101
102
107
109
1011
1013
dimesion (k)
mod
el s
ize
no mem adaptionfreqencyfreqency + l1 shrk
(a) Number of non-zero entries in V .
100
101
102
300
400
500
600
700
800
time (
sec)
no mem adaptionfreqencyfreqency + l1 shrk
100
101
102
200
300
400
500
600
dimesion (k)
time (
sec)
no mem adaptionfreqencyfreqency + l1 shrk
(b) Runtime for one iteration.
100
101
102
−3
−2.5
−2
−1.5
−1
−0.5
0
dimesion (k)
rela
tive lo
glo
ss (
%)
no mem adaptionfreqencyfreqency + l1 shrk
100
101
102
−1
−0.8
−0.6
−0.4
−0.2
0
dimesion (k)
rela
tive lo
glo
ss (
%)
no mem adaptionfreqencyfreqency + l1 shrk
(c) Relative test logloss comparing to logistic regression (k = 0 and 0 relative loss).
Figure 2: Using di↵erent adaptive memory constraints when varying the embedding dimension. Left: Criteo2. Right: CTR2.
A more interesting observation is that these memory adap-tive constraints do not a↵ect the test accuracy. To the con-trary, we even see a slight improvement when the dimensionk is greater than 8 for CTR2. The reason could be that themodel capacity control is of great importance when the di-mension k is large. And these memory adaptive constraintscan provide additional capacity control besides the `2 and`1 regularizers.
5.3 Fixed-point Compression
We evaluate lossy fixed-point compression for data com-munication. By default, both the model and gradient entriesare represented as 32 bit floats. In this experiment, we com-press these values to lower precision integers. More specifi-cally, given a bin size b and number of bits n, we representx by the following n-bit integer
z :=jxb⇥ 2n
k+ �, (23)
where � is a Bernoulli random variable chosen such as toensure that E[z] = 2n x
b
.
(Criteo 1TB)
k
what everyone else does
Marianas Labs
Faster Solver (small Criteo)
1 2 3 40
100
200
300
400
500
#byte per entry
Gig
ab
yte
Criteo2CTR2
(a) Total data sent by workers in one iteration. The compression
rates from 4-byte to 1-byte are 4.2x and 2.9x for Criteo2 and CTR2,
respectively.
1 2 3 4−2
0
2
4
6
#byte per entry
rela
tive
log
loss
(%
)
Criteo2CTR2
(b) The relative test logloss comparing to no fixed-point compression.
Figure 3: Compressing model and gradient using the fixed-point compression, where 4-byte means using the default32-bit floating-point format.
We implemented the fixed-point compression as a user-defined filter in the parameter server framework. Since mul-tiple numbers are communicated in each round, we chooseb to be the absolute maximum value of these numbers. Inaddition, we used the key caching and lossless data compres-sion (via LZ4) filters.
The results for n = 8, 16, 24 are shown in Figure 3. As ex-pected, fixed-point compression linearly reduces the networktra�c volume, since the tra�c is dominated by communi-cating the model and gradient. A more interesting observa-tion is that we obtained a 4.2x compression rate from 32-bitfloating-point to 8-bit fixed-point on Criteo2. The reason isthe latter improves the compression rate for the followinglossless LZ4 compression.
We observed di↵erent e↵ects of accuracy on these twodatasets: CTR2 is robust to the number precision, whileCriteo2 has a 6% increase of logloss if only using 1-byte pre-sentation. However, a medium compression rate even im-proves the model accuracy. This might be because the lossycompression acts as a regularization to the objective func-tion.
5.4 Comparison with LibFM
101
102
103
104
10−0.348
10−0.345
10−0.342
10−0.339
time (sec)
test
loglo
ss
LibFMDiFacto, 1 workerDiFacto, 10 workers
(a) Criteo1
100
101
102
103
10−0.193
10−0.191
10−0.189
time (sec)
test
loglo
ss
LibFMDiFacto, 1 workerDiFacto, 10 workers
(b) CTR1
Figure 4: Comparison with LibFM on a single machine. Thedata preprocessing time for LibFM is omitted.
To our best knowledge, there is no publicly released dis-tributed FM solver. Hence we only compare DiFacto to thepopular single machine package LibFM developed by Ren-dle [14]. We only report results on Criteo1 and CTR1 on asingle machine, since LibFM fails on the other two largerdatasets. We perform a similar grid search of the hyper-parameters as we did for DiFacto. As LibFM only usessingle thread, we run DiFacto with 1 worker and 1 serverin sequential execution order. We also report the perfor-mance using 10 workers and 10 servers on a single machinefor reference.The results are shown in Figure 4. As can been seen,
DiFacto converges significantly faster than LibFM, it uses2 times fewer iterations to reach the best model. This isbecause the adaptive learning rate used in DiFacto bettermodels the data sparsity and the adaptive regularization andconstraints can further accelerate the convergence. In par-ticular, the latter results in a lower test logloss on the CTR1
dataset, where the number of features exceeds the numberof examples, requiring improved capacity control.Also note that DiFacto with a single worker is twice slower
than LibFM per iteration. This is because the data commu-
sec
LibFM died on large models
Marianas Labs
0 5 10 15 200
2
4
6
8
10
# of machines
spe
ed
up
(x)
Criteo2CTR2
Figure 5: The speedup from 1 machine to 16 machines,where each machine runs 10 workers and 10 servers. Thedi↵erence between the test logloss is within 0.5%, which isomitted in the figure.
nication overhead between the worker and the server cannotbe ignored in the sequential execution. More importantly,DiFacto does not require any data preprocessing to map ar-bitrary 64-bit integer and string feature indices, which areused in both Criteo1 and CTR2, to continuous integer indices.The cost of this data preprocessing step, required by LibFMbut not shown in Figure 4, even exceeds the cost of train-ing (1,400 seconds for Criteo1). Nevertheless, DiFacto with asingle worker still outperforms LibFM thanks to the fasterconvergence. In addition, it is 10 times faster than LibFMwhen using 10 workers on the same machine.
5.5 ScalabilityFinally, we study the scalability of DiFacto by varying the
number of physical machines used in training. We run 10workers and 10 servers in each machine, and increase thenumber of machines from 1 to 16. Both the time for eachiteration and the logloss on the test dataset are recorded andshown in Figure 5.
We observe an 8x speedup from 1 machine to 16 machineson both Criteo2 and CTR2. The reason for the satisfactoryperformance is twofold. First, asynchronous SGD eliminatesthe need for synchronization between workers and it is toler-ant to stragglers. Even though we used dedicated machinesfor the job, they still share network bandwidth with oth-ers. In particular, we observed a large variation of readspeed when streaming data from Amazon’s S3 service, de-spite using the IO optimized c4.8xlarge series of machines.Second, DiFacto uses several filters which e↵ectively reducethe amount of network tra�c. Even though CTR2 produces10 times more network tra�c than Criteo2, they have similarspeedup performance.
There is a longstanding suspicion that the convergence ofasynchronous SGD slows down when increasing the numberof workers. Nonetheless, we did not observe a substantialdi↵erence in model accuracy. In other words, the relative dif-ference of the objective logloss on test datasets is below 0.5%when increasing the number of workers from 10 to 160. Thereason might because the datasets we used are highly sparseand the features are not extremely correlated. Hence incon-sistency due to concurrently updating by multiple workers
may not have a major e↵ect.
6. CONCLUSIONIn this paper we presented DiFacto, a high performance
distributed factorization machine. DiFacto uses a refinedfactorization machine model with adaptive memory con-straints and frequency adaptive regularization, which per-form fine-grain control of model capacity based on both dataand model statistics. DiFacto solves the model based onasynchronous stochastic gradient descent. We analyzed thetheoretical convergence and implemented it in the parameterserver framework. We evaluated DiFacto thoroughly on tworeal computational advertising datasets with up to billions ofexamples and features. We showed that DiFacto produceshundreds more compact model, yet it retains similar gen-eralization accuracy. We also demonstrated that DiFactoconverges faster than the state-of-the-art package LibFM.Furthermore, DiFacto is able to solve terabyte scale prob-lems on modest machines within few hours.
References[1] A. Ahmed, Nino Shervashidze, Shravan Narayana-
murthy, Vanja Josifovski, and Alexander J. Smola. Dis-tributed large-scale natural graph factorization. InWorld Wide Web Conference, Rio de Janeiro, 2013.
[2] R. M. Bell and Y. Koren. Lessons from the netflixprize challenge. SIGKDD Explorations, 9(2):75–79,2007. URL http://doi.acm.org/10.1145/1345448.
1345465.[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,
Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker,K. Yang, and A. Ng. Large scale distributed deep net-works. In Neural Information Processing Systems, 2012.
[4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic opti-mization. Journal of Machine Learning Research, 12:2121–2159, 2010.
[5] Criteo Labs. Criteo terabyte click logs, 2014.http://labs.criteo.com/downloads/download-terabyte-click-logs.
[6] Q.V. Le, T. Sarlos, and A. J. Smola. Fastfood — com-puting hilbert space expansions in loglinear time. InInternational Conference on Machine Learning, 2013.
[7] M. Li, D. G. Andersen, J. Park, A. J. Smola, A. Amhed,V. Josifovski, J. Long, E. Shekita, and B. Y. Su. Scalingdistributed machine learning with the parameter server.In OSDI, 2014.
[8] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Com-munication e�cient distributed machine learning withthe parameter server. In Neural Information ProcessingSystems, 2014.
[9] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu.Asynchronous parallel stochastic gradient for noncon-vex optimization. arXiv preprint arXiv:1506.08272,2015.
[10] B. McMahan. Follow-the-regularized-leader and mirrordescent: Equivalence theorems and l1 regularization. InInternational Conference on Artificial Intelligence andStatistics, pages 525–533, 2011.
[11] H Brendan McMahan, Gary Holt, D Sculley, MichaelYoung, Dietmar Ebner, Julian Grady, Lan Nie, Todd
Multiple Machines Li, Wang, Liu, Smola, WSDM’16
80x LibFM speed on 16 machines
Marianas Labs
Details• Parameter Server Basics
Logistic Regression (Classification)• Memory Subsystem
Matrix Factorization (Recommender)• Large Distributed State
Factorization Machines (CTR)• GPUs
MXNET and applications• Data & Models
Marianas Labs
MXNET - Distributed Deep Learning• Several deep learning frameworks
• Torch7: matlab-like tensor computation in Lua• Theano: symbolic programming in Python • Caffe: configuration in Protobuf
• Problem - need to learn another system for a different programing flavor
• MXNET• provides programming interfaces in a consistent way• multiple languages: C++/Python/R/Julia/…• fast, memory efficient, and distributed training
Marianas Labs
Programming Interfaces• Tensor Algebra Interface
• compatible (e.g. NumPy)• GPU/CPU/ARM (mobile)
• Symbolic Expression• Easy Deep Network
design• Automatic differentiation
• Mixed programming• CPU/GPU• symbolic + local code
Python
Julia
Marianas Labs
from data import ilsvrc12_iteratorimport mxnet as mx
## define alexnetinput_data = mx.symbol.Variable(name="data")# stage 1conv1 = mx.symbol.Convolution(data=input_data, kernel=(11, 11), stride=(4, 4), num_filter=96)relu1 = mx.symbol.Activation(data=conv1, act_type="relu")pool1 = mx.symbol.Pooling(data=relu1, pool_type="max", kernel=(3, 3), stride=(2,2))lrn1 = mx.symbol.LRN(data=pool1, alpha=0.0001, beta=0.75, knorm=1, nsize=5)# stage 2conv2 = mx.symbol.Convolution(data=lrn1, kernel=(5, 5), pad=(2, 2), num_filter=256)relu2 = mx.symbol.Activation(data=conv2, act_type="relu")pool2 = mx.symbol.Pooling(data=relu2, kernel=(3, 3), stride=(2, 2), pool_type="max")lrn2 = mx.symbol.LRN(data=pool2, alpha=0.0001, beta=0.75, knorm=1, nsize=5)# stage 3conv3 = mx.symbol.Convolution(data=lrn2, kernel=(3, 3), pad=(1, 1), num_filter=384)relu3 = mx.symbol.Activation(data=conv3, act_type="relu")conv4 = mx.symbol.Convolution(data=relu3, kernel=(3, 3), pad=(1, 1), num_filter=384)relu4 = mx.symbol.Activation(data=conv4, act_type="relu")conv5 = mx.symbol.Convolution(data=relu4, kernel=(3, 3), pad=(1, 1), num_filter=256)relu5 = mx.symbol.Activation(data=conv5, act_type="relu")pool3 = mx.symbol.Pooling(data=relu5, kernel=(3, 3), stride=(2, 2), pool_type="max")# stage 4flatten = mx.symbol.Flatten(data=pool3)fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=4096)relu6 = mx.symbol.Activation(data=fc1, act_type="relu")dropout1 = mx.symbol.Dropout(data=relu6, p=0.5)# stage 5fc2 = mx.symbol.FullyConnected(data=dropout1, num_hidden=4096)relu7 = mx.symbol.Activation(data=fc2, act_type="relu")dropout2 = mx.symbol.Dropout(data=relu7, p=0.5)# stage 6fc3 = mx.symbol.FullyConnected(data=dropout2, num_hidden=1000)softmax = mx.symbol.SoftmaxOutput(data=fc3, name='softmax')
## databatch_size = 256train, val = ilsvrc12_iterator(batch_size=batch_size, input_shape=(3,224,224))
## trainnum_gpus = 4gpus = [mx.gpu(i) for i in range(num_gpus)]model = mx.model.FeedForward( ctx = gpus, symbol = softmax, num_round = 20, learning_rate = 0.01, momentum = 0.9, wd = 0.00001)model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size))
6 layers definition6 layers definition
multi GPUmulti GPUtrain ittrain it
Marianas Labs
Engine• All operations issued into backend C++ engine
Binding languages’ performance does not matter • Engine builds the read/write dependency graph
• lazy evaluation• parallel execution• sophisticated memory
allocation
Marianas Labs
Distributed Deep Learning
Marianas Labs
Distributed Deep Learning
Plays well with S3 Spot instances &
containers
Marianas Labs
• Imagenet datasets (1m images, 1k classes)• Google Inception network
(88 lines of code)
Results
40% faster due to parallelization
1 2 40
100
200
300
400
# o
f im
ages
/ se
c
# of GPUs
CXXNetMXNet
naive inplace sharing both0
1
2
3
4
5
mem
ory
usa
ge (
GB
)
50% GPU memory reduction
Marianas Labs
Distributed Results
x
0
2
4
6
8
Mbp
s
0
300
600
900
1200
machine
1 2 3 4 5 6 7 8 9 10
bandwidth speedup
g2.2xlarge network limit
100
102
0.2
0.3
0.4
0.5
0.6
0.7
time (hour)
acc
ura
cy
single GPU4 x dual GPUs
Convergence on 4 x dual GTX980
Marianas Labs
Amazon g2.8xlarge• 12 instances (48 GPUs) @ $0.50/h spot• Minibatch size 512• BSP with 1 delay between machines• 2 GB/s bandwidth between machines (awful)
10.113.170.187,10.157.109.227, 10.169.170.55, 10.136.52.151, 10.45.64.250, 10.166.137.100, 10.97.167.10, 10.97.187.157, 10.61.128.107, 10.171.105.160, 10.203.143.220, 10.45.71.20 (all over the place in availability zone)
• Compressing to 1 byte per coordinate helps a bit but adds latency due to extra pass (need to fix)
• 37x speedup on 48 GPUS• Imagenet’12 dataset in trained in 4h, i.e. $24
(with alexnet; googlenet even better for network)
Marianas Labs
Mobile, too• 10 hours on 10 GTX980• Deployed on Android
Marianas Labs
Plans• This month (went live 2 months ago)
• 700 stars• 27 authors with
>1k commits• Ranks 5th on Github C++ project trending
• Build a larger community• Many more applications & platforms
• Image/video analysis, Voice recognition• Language model, machine translation• Intel drivers
Marianas Labs
Side-effects of using MXNET
Zichao was using Theano
Marianas Labs
Tensorflow vs. MXNETLanguages MultiGPU Distributed Mobile Runtime
Engine
Tensorflow Python Yes No Yes ?
MXNET Python, R, Julia, Go Yes Yes Yes Yes
(Googlenet)E5-1650/980
Tensor flow Torch7 Caffe MXNET
Time 940ms 172ms 170ms 180ms
Memory all (OOM after 24) 2.1GB 2.2GB 1.6GB
Marianas Labs
More numbers (this morning)
Googlenetbatch 32 (24)
Alexnetbatch 128 VGG
batch 32 (24)
Torch7 172ms 2.1GB 135ms 1.6GB 459ms 3.4gb
Caffe 170ms 2.2GB 165ms 1.34GB 448ms 3.4gb
MXNET172ms
(new and improved)
1.5GB 147ms 1.38GB 456ms 3.6gb
Tensor flow 940ms - 433ms - 980ms -
Marianas Labs
Details• Parameter Server Basics
Logistic Regression (Classification)• Memory Subsystem
Matrix Factorization (Recommender)• Large Distributed State
Factorization Machines (CTR)• GPUs
MXNET and applications• Models
Marianas Labs
Models• Fully connected networks (generic data)
(70s and 80s … trivial to scale, see e.g. CAFFE)• Convolutional networks (1D, 2D grids)
(90s - images, 2014 - text … see e.g. CUDNN)• Recurrent networks (time series, text)
(90s - LSTM - latent variable autoregressive)• Memory
• Attention model (Cho et al, 2014)• Memory model (Weston et al, 2014)• Memory rewriting - NTM (Graves et al, 2014)
Marianas Labs
Marianas Labs
Memory Networks(Sukhbataar, Szlam, Fergus, Weston, 2015)
attention layer
state layer
query
Marianas Labs
Memory Networks(Sukhbataar, Szlam, Fergus, Weston, 2015)
• Query weirdly defined (via bilinear embedding matrix)
• Works on facts / text only(ingest sequence and weigh it)
• ‘Ask me anything’ by Socher et al. 2015 (uses GRU rather than LSTMs)
Marianas Labs
Memory networks for images
http://arxiv.org/abs/1511.02274 (Yang et al, 2015)
Marianas Labs
ResultsBest results on
• DAQUAR• DAQUAR-
REDUCED• COCO-QA
Marianas Labs
Marianas Labs
Marianas Labs
Summary• Parameter Server Basics
Logistic Regression • Large Distributed State
Factorization Machines • Memory Subsystem
Matrix Factorization• GPUs
MXNET and applications• Models
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Data (local or cloud)
read
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
proc
ess
write
Data (local or cloud)
local state
(or copy)
update
Server
Server
github.com/dmlc
Marianas Labs
Many thanks to• CMU & Marianas
• Mu Li • Ziqi Liu • Yu-Xiang Wang• Zichao Zhang
• Microsoft• Xiaodong He• Jianfeng Gao• Li Deng
• MXNET• Tianqi Chen (UW)• Bing Xu• Naiyang Wang• Minjie Wang• Tianjun Xiao• Jianpeng Li• Jiaxing Zhang• Chuntao Hong