Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive...

68

Transcript of Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive...

Page 1: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 2: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

With many contributors:

A. Agarwal, E. Akchurin, E. Barsoum, C. Basoglu, G. Chen, S. Cyphers, W. Darling, J. Droppo, K. Deng, A. Eversole, B. Guenter, P. He, M. Hillebrand, X.

Huang, Z. Huang, R. Hoens, V. Ivanov, A. Kamenev, N. Karampatziakis, P. Kranen, O. Kuchaiev, W. Manousek, C. Marschner, A. May, B. Mitra, O. Nano, G.

Navarro, A. Orlov, M. Radmilac, A. Reznichenko, P. Parthasarathi, S. Pathak, B. Peng, A. Reznichenko, W. Richert, F. Seide, M. Seltzer, M. Slaney, A.

Stolcke, T. Will, H. Wang, Z. Wang, W. Xio. Yao, D. Yu, C. Zhang, Y. Zhang, G. Zweig

Sayan Pathak Principal ML Scientist

Chris BasogluPartner Dev Manager

Page 3: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Outline

• CNTK overview

• Key features• Symbolic loop

• Batch scheduling

• Data parallel training

• Conclusions

Page 4: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Outline

• CNTK overview

• Key features• Symbolic loop

• Batch scheduling

• Data parallel training

• Conclusions

Page 5: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Deep learning at Microsoft

• Microsoft Cognitive Services

• Skype Translator

• Cortana

• Bing

• HoloLens

• Microsoft Research

Page 6: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Microsoft Cognitive Services

Page 7: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Page 8: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

ImageNet: Microsoft 2015 ResNet

28.225.8

16.4

11.7

7.3 6.73.5

ILSVRC2010 NECAmerica

ILSVRC2011 Xerox

ILSVRC2012

AlexNet

ILSVRC2013 Clarifi

ILSVRC2014 VGG

ILSVRC2014

GoogleNet

ILSVRC2015 ResNet

ImageNet Classification top-5 error (%)

Microsoft had all 5 entries being the 1-st places this year: ImageNet classification,

ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation

Page 10: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Microsoft Translator http://translate.it

Page 11: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

You can follow along to this presentation on your own device, in the language of your choice.

Download the Microsoft Translator app for Android, iOS, or Windows

or

Visit translate.it/<ENTER CODE HERE>

Type in this unique conversation code below to join this conversation

<ENTER CODE HERE>

Microsoft Translator is powered by machine learning. Any voice or text information you provide will be used to improve Microsoft products and services.

Page 12: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Bing / Bing Ads

Page 13: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Microsoft’s historicspeech breakthrough

• Microsoft 2016 research system for

conversational speech recognition

• 5.9% word-error rate

• enabled by CNTK’s multi-server scalability

[W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke,

D. Yu, G. Zweig: “Achieving Human Parity in Conversational

Speech Recognition,” https://arxiv.org/abs/1610.05256]

Page 14: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft Customer Support Agent

Page 15: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 16: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 17: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 18: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 19: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 20: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Microsoft Cognitive Toolkit (CNTK)

• Microsoft’s open-source deep-learning toolkit

• https://github.com/Microsoft/CNTK

• Created by Microsoft Speech researchers in 2012

• On GitHub since Jan 2016 under MIT license

• Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit”

• External contributions e.g. from MIT, Stanford and NVidia

Page 21: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Microsoft Cognitive Toolkit (CNTK)

• Over 80% Microsoft internal DL workload runs CNTK

• 1st-class on Linux and Windows, docker support

• Python, C++, C#, Java

• Internal == External

Page 22: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

CNTK – The Fastest Toolkit

Caffe CNTK MxNet TensorFlow Torch

FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms

AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms

ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms

LSTM (256)(v7 benchmark)

- 43.581ms(44.917ms)

288.142ms(284.898ms)

-(223.547ms)

1130.606ms(906.958ms)

http://dlbench.comp.hkbu.edu.hk/ Benchmarking by HKBU, Version 8Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1

Caffe: 1.0rc5(39f28e4)CNTK: 2.0 Beta10(1ae666d)MXNet: 0.93(32dc3a2)TensorFlow: 1.0(4ac9c09)Torch: 7(748f5e3)

Page 23: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”

Theano only supports 1 GPU

Achieved with 1-bit gradient quantizationalgorithm

0

10000

20000

30000

40000

50000

60000

70000

80000

CNTK Theano TensorFlow Torch 7 Caffe

speed comparison (samples/second), higher = better

[note: December 2015]

1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)

Page 24: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Superior performance

Page 25: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Scalability

Page 26: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

What is new in CNTK 2.0?

https://esciencegroup.com/2016/11/10/cntk-revisited-a-new-deep-learning-toolkit-release-from-microsoft/

Microsoft has now released a major upgrade of the software

and rebranded it as part of the Microsoft Cognitive

Toolkit. This release is a major improvement over the initial

release.

There are two major changes from the first release that you

will see when you begin to look at the new release. First is

that CNTK now has a very nice Python API and, second, the

documentation and examples are excellent.

Installing the software from the binary builds is very easy on

both Ubuntu Linux and Windows.

Page 27: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

CNTK Other Advantages

• Python and C++ API• Mostly implemented in C++

• Low level + high level Python API

• Extensibility • User functions and learners in pure Python

• Readers • Distributed, highly efficient built-in data readers

• Internal == External

Page 28: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 29: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.

• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.

The Microsoft Cognitive Toolkit (CNTK)

Page 30: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

MNIST Handwritten Digits (OCR)

• Data set of hand written digits with60,000 training images

10,000 test images

• Each image is: 28 x 28 pixels

Handwritten Digits

1 5 4 35 3 5 35 9 0 6

Corresponding Labels

Page 31: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Multi-layer perceptron

28 pix

28

pix

.

784 pixels (x)

.

Di = 784O= 400a = relu

Di = 400O= 200a = relu

D10 nodes i = 200

O= 10a = None

Weights

784

400 + 400 bias

400

200 + 200 bias

200

10 + 10 bias

Deep Model

z0 z1 z2 z3 z4 z5 z6 z7 z8 z9

0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01𝑒𝑧i

σ𝑗=09 𝑒𝑧j

softmax

https://github.com/Microsoft/CNTK/tree/master/Tutorials

Page 32: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

28 pix

28

pix

.

28 x 28 pix (p)

Loss function

Lossfunction

ce = −σ𝑗=09 𝑦𝑗 𝑙𝑜𝑔 𝑝𝑗

Cross entropy error

1 5 4 35 3 5 35 9 0 6

Label One-hot encoded (Y)

0 0 0 1 0 0 0 0 0 0

Model(w, b)

Predicted Probabilities (p)

0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01

Page 33: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P ce = cross_entropy (L, P)

CNTK Model

Page 34: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

example: 2-hidden layer feed-forward NN

h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)

h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)

P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout)

with input x RM and one-hot label y RJ

and cross-entropy training criterion

ce = yT log P ce = cross_entropy (P, y)

CNTK Model

Page 35: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

h1 = sigmoid (x @ W1 + b1)

h2 = sigmoid (h1 @ W2 + b2)

P = softmax (h2 @ Wout + bout)

ce = cross_entropy (P, y)

ce

CNTK Model

Page 36: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

ce• Nodes: functions (primitives)

• Can be composed into reusable composites

• Edges: values• Incl. tensors, sparse

• Automatic differentiation• ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in

• Deferred computation execution engine

• Editable, clonable

LEGO-like composability allows CNTK to supportwide range of networks & applications

CNTK Model

Page 37: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• “model function”• features predictions

• defines the model structure & parameter initialization

• holds parameters that will be learned by training

• “criterion function”• (features, labels) (training loss, additional metrics)

• defines training and evaluation criteria on top of the model function

• provides gradients w.r.t. training criteria

Authoring networks as functions

+

s

+

s

+

softmax

W1

b1

W2

b2

Wout

bout

cross_entropy

h1

h2

P

x y

ce

Page 38: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Authoring networks as functions

• CNTK model: neural networks are functions• pure functions

• with “special powers”:• can compute a gradient w.r.t. any of its nodes

• external deity can update model parameters

• user specifies network as function objects:• formula as a Python function (low level, e.g. LSTM)

• function composition of smaller sub-networks (layering)

• higher-order functions (equiv. of scan, fold, unfold)

• model parameters held by function objects

• “compiled” into the static execution graph under the hood

Page 39: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Layers lib: full list of layers/blocks• layers/blocks.py:

• LSTM(), GRU(), RNNUnit()• Stabilizer(), identity• ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[]

• layers/layers.py:• Dense(), Embedding()• Convolution(), Convolution1D(), Convolution2D(), Convolution3D(), Deconvolution()• MaxPooling(), AveragePooling(), GlobalMaxPooling(), GlobalAveragePooling(), MaxUnpooling()• BatchNormalization(), LayerNormalization()• Dropout(), Activation()• Label()

• layers/higher_order_layers.py:• Sequential(), For(), operator >>, (function tuples)• ResNetBlock(), SequentialClique()

• layers/sequence.py:• Delay(), PastValueWindow()• Recurrence(), RecurrenceFrom(), Fold(), UnfoldFrom()

• models/models.py:• AttentionModel()

Page 40: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...
Page 41: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• Symbolic loops over sequences with dynamic scheduling

• Turn graph into parallel program through minibatching

• Unique parallel training algorithms (1-bit SGD, Block Momentum)

CNTK unique features

Page 42: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Symbolic loops over sequential data

extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)

h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = yT(t) log P(t) ce = cross_entropy(P, L)

no explicit notion of time

Page 43: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Symbolic loops over sequential data

extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)

h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = yT(t) log P(t) ce = cross_entropy(P, L)

no explicit notion of time

Page 44: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Symbolic loops over sequential data

extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)

h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = yT(t) log P(t) ce = cross_entropy(P, L)

no explicit notion of time

Page 45: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Symbolic loops over sequential data

extend our example to a recurrent network (RNN)

h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1)

h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)

P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout)

ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)

Scorpusce(t) = max

Page 46: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Symbolic loops over sequential data

+

s

+

softmax

W1

b1

Wout

bout

cross_entropy

h1

P

x y

ce

h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1)

h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)

P = softmax(h2 @ Wout + bout)

ce = cross_entropy(P, y)

• CNTK automatically unrolls cycles deferred computation

• Efficient and composable

+ •

H1

z-1

+

s

W2

b2

h2

+ •

H2

z-1

Page 47: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

Batch-scheduling of variable-length sequences

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Page 48: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 49: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 3

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 50: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 51: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 52: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 53: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 54: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• CNTK handles the special cases:• past_value operation correctly resets state and gradient at sequence boundaries

• non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”)

• sequence reductions

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Batch-scheduling of variable-length sequences

Page 55: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• minibatches containing sequences of different lengths are automatically packed and padded

• speed-up is automatic:

para

llel sequences

time steps computed in parallel

padding

sequence 1

sequence 2 sequence 3

sequence 4

sequence 5 sequence 6

sequence 7

Naïve , Single Sequence, 1

Optimized, multi sequence >20

0 5 10 15 20 25

Naïve

Optimized

Speed comparison on RNNs

Batch-scheduling of variable-length sequences

Page 56: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

all-reduce

Data-parallel training

node 1 node 2 node 3

S

Page 57: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

Data-parallel training

node 1 node 2 node 3

Page 58: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients

Data-parallel training

node 1 node 2 node 3

ring algorithmO(2 (K-1)/K M)

O(1) w.r.t. K

Page 59: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

How to reduce communication cost:

communicate less each time

communicate less often

Data-parallel training

Page 60: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

node 1 node 2 node 3

How to reduce communication cost:

communicate less each time

• 1-bit SGD:[F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]

• Quantize gradients to 1 bit per value

• Trick: carry over quantization error to next minibatch

1-bit quantized with residual

1-bit quantized with residual

Data-parallel training

Page 61: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

GPU 1 GPU 2 GPU 3

how to reduce communication cost:

communicate less each time

• 1-bit SGD:[F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]

• quantize gradients to 1 bit per value

• trick: carry over quantization error to next minibatch

1-bit quantized with residual

1-bit quantized with residual

data-parallel training

minibatch

Page 62: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

How to reduce communication cost:

communicate less each time

• 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014]

• quantize gradients to 1 bit per value

• trick: carry over quantization error to next minibatch

communicate less often

• Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014]

• Block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016]

• Very recent, very effective parallelization method

• Combines model averaging with error-residual idea

Data-parallel training

Page 63: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Benchmark result of parallel training on CNTK

2.9 5.4

8.0 3.3

6.7 10.8

3.7 6.9

13.8

25.5

43.7

4.1 8.1

14.1

27.3

54.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs

1bit/BMUF Speedup Factors in LSTM Training

1bit-average

1bit-peak

BMUF-average

BMUF-peak

• Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana

• About 16 and 20 days to train DNN and LSTM on 1-GPU, respectively

Credit: Yongqiang Wang, Kai Chen, Qiang Huo

Page 64: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Microsoft

Cognitive

Toolkit

Results

• Achievement• Almost linear speedup without degradation of model quality

• Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks

• Released in CNTK as a critical differentiator

• Used for enterprise scale production data loads

• Production tools in other companies such as iFLYTEK and Alibaba

Page 65: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Where to begin?On GitHub: https://github.com/Microsoft/CNTK/wiki

Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release)https://github.com/Microsoft/CNTK/tree/master/Tutorials (latest)

Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials

Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)

Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)

Page 66: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Where to begin?Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release)https://github.com/Microsoft/CNTK/tree/master/Tutorials (latest)

Page 67: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Where to begin?Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials

Page 68: Sayan Pathak - Microsoft Cognitive Toolkit · PDF file · 2017-05-16Cognitive Toolkit Deep learning at Microsoft ... •Each image is: 28 x 28 pixels Handwritten Digits 1 5 4 3 ...

Where to begin?On GitHub: https://github.com/Microsoft/CNTK/wiki

Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release)https://github.com/Microsoft/CNTK/tree/master/Tutorials (latest)

Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials

Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)

Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)