Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF...
-
Upload
mlconf -
Category
Technology
-
view
644 -
download
3
Transcript of Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF...
![Page 1: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/1.jpg)
Learning at Scale:
Deep, Distributed and Multi-dimensional
Anima Anandkumar
..
Amazon AI & Caltech
![Page 2: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/2.jpg)
Significantly improve many applications on multiple domains
“deep learning” trend in the past 10 years
image understanding speech recognition natural language processing
…
Deep Learning
autonomy
![Page 3: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/3.jpg)
Image Classification
Layer 1 Layer 2 Output
multilevel feature extractions from raw pixels to semantic meanings
explore spatial information with convolution layers
![Page 4: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/4.jpg)
Image Classification
§ Hard to define the network§ the definition of the inception network has >1k lines of codes in Caffe
§ A single image requires billions floating-point operations§ Intel i7 ~500 GFLOPS § Nvidia Titan X: ~5 TFLOPS
§ Memory consumption is linear with number of layers
State-of-the-art networks have tens to hundreds layers
![Page 5: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/5.jpg)
Outline
1 Introduction
2 Distributed Deep Learning Using Mxnet
3 Learning in Multiple Dimensions
4 Conclusion
![Page 6: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/6.jpg)
3. MXNet
image credit - wikipedia
• Imperative and Declarative Programming• Language Support• Backend and Automatic Parallelization
![Page 7: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/7.jpg)
Writing Parallel Programs is Painful
Each forward-backward-update involves O(num_layer), which is
often 100—1,000, tensor computations and communications
data = next_batch()data[gpu0].copyfrom(data[0:50])
_, fc1_wgrad[gpu0] = FullcBackward(fc1_ograd[gpu0] ,
fc1_weight[gpu0])
fc1_ograd[gpu0], fc2_wgrad[gpu0] = FullcBackward(fc2_ograd[gpu0] ,
fc2_weight[gpu0])
fc2_ograd[gpu0] = LossGrad(fc2[gpu0], label[0:50])
fc2[gpu0] = FullcForward(fc1[gpu0], fc2_weight[gpu0])
fc1[gpu0] = FullcForward(data[gpu0], fc1_weight[gpu0])
fc2_wgrad[cpu] = fc2_wgrad[gpu0] + fc2_wgrad[gpu1]
fc2_weight[cpu].copyto(fc2_weight[gpu0] ,
fc2_weight[gpu1])
fc2_weight[cpu] -= lr*fc12_wgrad[gpu0]
fc1_weight[cpu] -= lr * fc1_wgrad[gpu0]
fc1_wgrad[cpu] = fc1_wgrad[gpu0] + fc1_wgrad[gpu1]
fc1_weight[cpu].copyto(fc1_weight[gpu0] ,
fc1_weight[gpu1])
data[gpu0].copyfrom(data[51:100])
_, fc1_wgrad[gpu1] = FullcBackward(fc1_ograd[gpu1] ,
fc1_weight[gpu1])
fc1_ograd[gpu1], fc2_wgrad[gpu1] = FullcBackward(fc2_ograd[gpu1] ,
fc2_weight[gpu1])
fc2_ograd[gpu1] = LossGrad(fc2[gpu1], label[51:100])
fc2[gpu1] = FullcForward(fc1[gpu1], fc2_weight[gpu1])
fc1[gpu1] = FullcForward(data[gpu1], fc1_weight[gpu1])
Dependency graph for 2-layer neural networks with 2 GPUs
![Page 8: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/8.jpg)
Auto Parallelization
18
Write serial programs Run in parallel
>>> import mxnet as mx >>> A = mx.nd.ones((2,2)) *2 >>> C = A + 2 >>> B = A + 1 >>> D = B * C>>> D.wait_to_read()
A = 2
C = A + 2 B = A + 1
D = B ⨉ C
![Page 9: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/9.jpg)
Data Parallelism
19
key-value store
examples
1. Read a data partition 2. Pull the parameters 3. Compute the gradient 4. Push the gradient 5. Update the parameters
![Page 10: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/10.jpg)
Scale to Multiple GPU Machines
21
PCIe Switch
GPU
GPU
GPU
GPU
CPU
Network Switch
63 GB/s 4 PCIe 3.0 16x
15.75 GB/s PCIe 3.0 16x
1.25 GB/s 10 Gbit Ethernet
Hierarchical parameter server
Level-1 Servers
Workers
Level-2 Servers
GPUs
CPUs
![Page 11: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/11.jpg)
Experiment Setup
✧ ✓ 1.2 million images with 1000 classes
✧ Resnet 152-layer model ✧ EC2 P2.16xlarge
22
GPU 0-15
PCIe switchesCPU
✧ Minibatch SGD ✧ Synchronized Updating
![Page 12: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/12.jpg)
Scalability over Multiple Machines
23
time
(sec
) / b
ath
0
0.25
0.5
0.75
1
# of GPUs
0 32 64 96 128
Comm Costbatch size/GPU=2batch size/GPU=4batch size/GPU=8batch size/GPU=16
115x
![Page 13: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/13.jpg)
8
2012before 2013 2014 2015 2016 2017
mxnetimperative
symbolicgluon
![Page 14: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/14.jpg)
Back-end System
✧ Optimization ✓ Memory optimization ✓ Operator fusion
✧ Scheduling ✓ Auto-parallelization
11
a b
1
+
⨉
c
fullc
softmax
weight
bias
Back-end
import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1
import mxnet as mxnet = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128)net = mx.symbol.SoftmaxOutput(data=net)texec = mx.module.Module(net)texec.forward(data=c)texec.backward()
Front-end
![Page 15: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/15.jpg)
In summary✦ Symbolic ❖ efficient & portable ❖ but hard to use
10
✦ tesla
✦ Imperative ❖ flexible ❖ may be slow
✦ Gluon ❖ imperative for developing ❖ symbolic for deploying
![Page 16: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/16.jpg)
Outline
1 Introduction
2 Distributed Deep Learning Using Mxnet
3 Learning in Multiple Dimensions
4 Conclusion
![Page 17: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/17.jpg)
Tensors: Beyond 2D world
Modern data is inherently multi-dimensional
![Page 18: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/18.jpg)
Tensors: Beyond 2D world
Modern data is inherently multi-dimensional
Input Hidden 1 Hidden 2 Output
![Page 19: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/19.jpg)
Tensor Contraction
Extends the notion of matrix product
Matrix product
Mv =∑
j
vjMj
= +
Tensor ContractionT (u, v, ·) =
∑
i,j
uivjTi,j,:
=
++
+
![Page 20: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/20.jpg)
Employing Tensor Contractions in Alexnet
Replace fully connected layer with tensor contraction layer
![Page 21: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/21.jpg)
Enabling Tensor Contraction Layer in Mxnet
![Page 22: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/22.jpg)
PerformanceoftheTCL
• Trainedend-to-end
• OnImageNetwithVGG:• 65.9%spacesavings• performancedropof0.6%only
• OnImageNetwithAlexNet:• 56.6%spacesavings• Performanceimprovementof0.5%
![Page 23: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/23.jpg)
Low-ranktensorregression
TensorRegressionNetworks,J.Kossaifi,Z.C.Lipton,A.Khanna,T.Furlanello andA.Anandkumar,ArXiv pre-publication
![Page 24: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/24.jpg)
Performanceandrank
![Page 25: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/25.jpg)
Speeding up Tensor Contractions
1 Tensor contractions are a core primitive of multilinear algebra.
2 BLAS 3: Unbounded compute intensity (no. of ops per I/O)
Consider single-index contractions: CC = AABB
=
=
A(:,1,:) A(:,2,:)A422
B21
C421
e.g. Cmnp = Amnk Bkp
![Page 26: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/26.jpg)
Speeding up Tensor Contraction
Explicit permutation dominates,especially for small tensors.
Consider Cmnp = AkmBpkn.
1 Akm → Amk
2 Bpkn → Bkpn
3 Cmnp → Cmpn
4 Cm(pn) = Amk Bk(pn)
5 Cmpn → Cmnp
100 200 300 400 5000
0.2
0.4
0.6
0.8
1
n
(Top) CPU. (Bottom) GPU. The fraction of timespent in copies/transpositions. Lines are shown with1, 2, 3, and 6 transpositions.
![Page 27: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/27.jpg)
Existing Primitives
GEMM
Suboptimal for many small matrices.
Pointer-to-Pointer BatchedGEMM
Available in MKL 11.3β and cuBLAS 4.1
C[p] = α op(A[p]) op(B[p]) + β C[p]
cublas<T>gemmBatched(cublasHandle_t handle,
cublasOperation_t transA, cublasOperation_t transB,
int M, int N, int K,
const T* alpha,
const T** A, int ldA,
const T** B, int ldB,
const T* beta,
T** C, int ldC,
int batchCount)
![Page 28: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/28.jpg)
Tensor Contraction with Extended BLAS Primitives
Cmn[p] = AmkBkn[p]
cublasDgemmStridedBatched(handle,
CUBLAS_OP_N, CUBLAS_OP_N,
M, N, K,
&alpha,
A, ldA1, 0,
B, ldB1, ldB2,
&beta,
C, ldC1, ldC2,
P)
![Page 29: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/29.jpg)
Tensor Contraction with Extended BLAS Primitives
Cmnp = A∗∗ ×B∗∗∗Cmnp ≡ C[m+ n · ldC1 + p · ldC2]
Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2
1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] 4.1 AknBkmp Cmn[p] = B>km[p]Akn
1.2 AmkBkpn Cmn[p] = AmkBk[p]n Cm[n]p = AmkBkp[n] 4.2 AknBkpm Cmn[p] = B>k[p]mAkn
1.3 AmkBnkp Cmn[p] = AmkB>nk[p] 4.3 AknBmkp Cmn[p] = Bmk[p]Akn
1.4 AmkBpkn Cm[n]p = AmkB>pk[n] 4.4 AknBpkm
1.5 AmkBnpk Cm(np) = AmkB>(np)k Cmn[p] = AmkB
>n[p]k 4.5 AknBmpk Cmn[p] = Bm[p]kAkn
1.6 AmkBpnk Cm[n]p = AmkB>p[n]k 4.6 AknBpmk
2.1 AkmBknp Cm(np) = A>kmBk(np) Cmn[p] = A>kmBkn[p] 5.1 ApkBkmn C(mn)p = B>k(mn)A>pk Cm[n]p = B>km[n]A
>pk
2.2 AkmBkpn Cmn[p] = A>kmBk[p]n Cm[n]p = A>kmBkp[n] 5.2 ApkBknm Cm[n]p = B>k[n]mA>pk
2.3 AkmBnkp Cmn[p] = A>kmB>nk[p] 5.3 ApkBmkn Cm[n]p = Bmk[n]A
>pk
2.4 AkmBpkn Cm[n]p = A>kmB>pk[n] 5.4 ApkBnkm
2.5 AkmBnpk Cm(np) = A>kmB>(np)k Cmn[p] = A>kmB
>n[p]k 5.5 ApkBmnk C(mn)p = B(mn)kA
>pk Cm[n]p = Bm[n]kA
>pk
2.6 AkmBpnk Cm[n]p = A>kmB>p[n]k 5.6 ApkBnmk
3.1 AnkBkmp Cmn[p] = B>km[p]A>nk 6.1 AkpBkmn C(mn)p = B>k(mn)Akp Cm[n]p = B>km[n]Akp
3.2 AnkBkpm Cmn[p] = B>k[p]mA>nk 6.2 AkpBknm Cm[n]p = B>k[n]mAkp
3.3 AnkBmkp Cmn[p] = Bmk[p]A>nk 6.3 AkpBmkn Cm[n]p = Bmk[n]Akp
3.4 AnkBpkm 6.4 AkpBnkm
3.5 AnkBmpk Cmn[p] = Bm[p]kA>nk 6.5 AkpBmnk C(mn)p = B(mn)kAkp Cm[n]p = Bm[n]kAkp
3.6 AnkBpmk 6.6 AkpBnmk
![Page 30: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/30.jpg)
A new primitive: StridedBatchedGEMM
Performance on par with pure GEMM (P100 and beyond).
![Page 31: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/31.jpg)
Applications: Tucker DecompositionTmnp = GijkAmiBnjCpk
mnp ijk
mi
njT GA
B
pkC Main steps in the algorithm
Ymjk = TmnpBtnjC
tpk
Yink = TmnpAt+1mi C
tpk
Yijp = TmnpBt+1nj A
t+1mi
Performance on Tucker decomposition:
20 40 60 80 100 12010−2
100
102
104
106
n
Tim
e(s
ec)
TensorToolboxBTAS
CyclopsCPU BatchedGPU Batched
![Page 32: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/32.jpg)
Tensor Sketches
Randomized dimensionality reductionthrough sketching.
◮ Complexity independent of tensor order:exponential gain!
+1
+1
-1
Tensor T
Sketch s
Applications
Tensor Decomposition via Sketching
Visual Question and Answering
CNN
RNN What is the
mustach made of?
CW
H
MCT
L
Avgpoolin
g
FC
Relu
Batc
hN
orm
FC "Banana"
Softm
ax
![Page 33: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/33.jpg)
MCT in Visual Question & Answering
CNN
RNNW��t is the
mustac� ���� ���
C�
H
M
L
Av�
����
FC
Relu
B����� � m
FC "����na"
Softm
a
x
![Page 34: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/34.jpg)
Multimodal Tensor Pooling
C W
H
L
Text feature
Image feature d1d2
d3Spatial sketch
Count sketch
3D FFT
1D FFT
3D IFFT
(optional)
d4
d1d2
d3
![Page 35: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/35.jpg)
Tensor Decompositions
![Page 36: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/36.jpg)
Extracting Topics from Documents
Topics
Topic Proportion
police
witness
campus
police
witness
campus
police
witness
campus
police
witness
crime
Sports
Educa�on
campus
A., D. P. Foster, D. Hsu, S.M. Kakade, Y.K. Liu.“Two SVDs Suffice: Spectral decompositions
for probabilistic topic modeling and latent Dirichlet allocation,” NIPS 2012.
![Page 37: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/37.jpg)
Tensor Methods for Topic Modeling
campus
police
witness
Topic-word matrix P[word = i|topic = j]
Linearly independent columns
Moment Tensor: Co-occurrence of Word Triplets
= + +campus
police
witness
crime
Sports
Educa�on
campus
police
witness
campus
police
witness
![Page 38: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/38.jpg)
Tensors vs. Variational InferenceCriterion: Perplexity = exp[−likelihood].
Learning Topics from PubMed on Spark, 8mil articles
0
2
4
6
8
10 ×104
RunningTim
e
103
104
105
Perplexity Tensor
Variational
Learning network communities from social network data
Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6.
102
103
104
105
106
RunningTim
e
FB YP DBLPsub DBLP 10-2
10-1
100
101
Error
FB YP DBLPsub DBLP
F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
![Page 39: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/39.jpg)
Tensors vs. Variational InferenceCriterion: Perplexity = exp[−likelihood].
Learning Topics from PubMed on Spark, 8mil articles
0
2
4
6
8
10 ×104
RunningTim
e
103
104
105
Perplexity Tensor
Variational
Learning network communities from social network data
Facebook n ∼ 20k, Yelp n ∼ 40k, DBLP-sub n ∼ 1e5, DBLP n ∼ 1e6.
102
103
104
105
106
RunningTim
e
FB YP DBLPsub DBLP 10-2
10-1
100
101
Error
FB YP DBLPsub DBLPOrders
ofMag
nitude Fa
ster &
MoreAc
curat
e
F. Huang, U.N. Niranjan, M. Hakeem, A, “Online tensor methods for training latent variable models,” JMLR 2014.
![Page 40: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/40.jpg)
Outline
1 Introduction
2 Distributed Deep Learning Using Mxnet
3 Learning in Multiple Dimensions
4 Conclusion
![Page 41: Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017](https://reader031.fdocuments.us/reader031/viewer/2022022415/5a65ee057f8b9a6f448b4a7b/html5/thumbnails/41.jpg)
ConclusionDistributed Deep Learning at Scale
Mxnet has many attractive features◮ Flexible programming◮ Portable◮ Highly efficient
Easy to deploy large-scale DL on AWS cloud◮ Deep Learning AMI◮ Cloud formation templates
Tensors are the future of ML
Tensor contractions: space savings in deep architectures.
New primitives speed up tensor contractions: extended BLAS
=
++
+
T
u
v
= + ....