Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ......
Transcript of Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ......
![Page 1: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/1.jpg)
2016/06/01
Deep LearningPresenter: Dr. Denis KrompaßSiemens Corporate Technology – Machine Intelligence [email protected]
Slides: Dr. Denis Krompaß and Dr. Sigurd Spieckermann
![Page 2: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/2.jpg)
2016/06/01
Deep Learning vs. Classic Data Modeling
INPUT
HAND-DESIGNEDPROGRAM
OUTPUT
INPUT
HAND-DESIGNEDFEATURES
MAPPING FROMFEATURES
OUTPUT
INPUT
FEATURES
MAPPING FROMFEATURES
OUTPUT
INPUT
FEATURES
ADDITIONAL LAYERS OFMORE ABSTRACT
FEATURES
MAPPING FROMFEATURES
OUTPUT
RULE-BASEDSYSTEMS
CLASSIC MACHINELEARNING REPRESENTATION LEARNING
DEEP LEARNING
SOURCE: http://www.deeplearningbook.org/contents/intro.html
LEARNED FROM DATA
![Page 3: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/3.jpg)
2016/06/01
Deep LearningHierarchical Feature Extraction
SOURCE: http://www.eidolonspeak.com/Artificial_Intelligence/SOA_P3_Fig4.png
This illustration only shows theidea!
In reality the learned featuresare abstract and hard to
interpret most of the time.
![Page 4: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/4.jpg)
2016/06/01
Deep LearningHierarchical Feature Extraction
SOURCE:Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). DeepFace: Closing the gap to human-level performance in faceverification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1701-1708).
![Page 5: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/5.jpg)
(Classic) Neural Networks are an importantbuilding block of Deep Learning but there is more
to it.
NEURAL NETWORKS HAVE BEEN AROUND FOR DECADES!
![Page 6: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/6.jpg)
2016/06/01
What’s new?
OPTIMIZATION & LEARNING
OPTIMIZATION ALGORITHMS• Adaptive Learning Rates (e.g.
ADAM)• Evolution Strategies• Synthetic Gradients• Asynchronous Training• …REPARAMETERIZATION• Batch Normalization• Weight Normalization• …REGULARIZATION• Dropout• DropConnect• DropPath• …
MODEL ARCHITECTURES
BUILDING BLOCKS• Spatial/temporal pooling• Attention mechanism• Variational Layers• Dilated convolution• Variable-length sequence modeling• Macro modules (e.g. Residual Units)• Factorized layers• …ARCHITECTURES• Neural computers and memories• General purpose image feature
extractors (VGG, GoogleLeNet)• End-to-end models• Generative Adversarial Networks• …
SOFTWARE
• Theano• Keras• Blocks
• TensorFlow• Keras• Sonnet• TensorflowFold
• Torch7• Caffe• …GENERAL• GPUs• Hardware accessibility (Cloud)• Distributed Learning• Data
* deprecated
![Page 7: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/7.jpg)
2016/06/01
Enabler: Tools
It has never been that easy to build deep learning models!
![Page 8: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/8.jpg)
2016/06/01
Enabler: Data
Deep Learning requires tons of labeled data if theproblem is really complex.
# Labeled examples Example problems solved in theworld.
1 – 10 Not worth a try.
10 – 100 Toy datasets.
100 – 1,000 Toy datasets.
1,000 – 10,000 Hand-written digit recognition.
10,000 – 100,000 Text generation.
100,000 – 1,000,000 Question answering, chat bots.
> 1,000,000 Multi language text translation.Object recognition in images/videos.
![Page 9: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/9.jpg)
2016/06/01
Enabler: Computing Power for Everyone
GPUs
kmmn RWRXWXh ´´ ÎÎ×= ,,
Matrix Products are highly parallelizable Distributed training enables us to train verylarge deep learning models on tons of data
![Page 10: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/10.jpg)
2016/06/01
Deep Learning Research
Companies People
Jürgen
Schmidhuber
Geoffrey Hinton
Yann LeCun
Andrew NgYoshua Bengio
![Page 11: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/11.jpg)
2016/06/01
Lecture Overview
• Part I – Deep Learning Model Architecture Design
• Part II – Training Deep Learning Models
• Part III – Deep Learning and Artificial (General) Intelligence
![Page 12: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/12.jpg)
Deep LearningPart I
Deep Learning Model Architecture Design
![Page 13: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/13.jpg)
2016/06/01
Part I – Deep Learning Model Architecture
Basic Building Blocks• The fully connected layer – Using brute force.• Convolutional neural network layers – Exploiting neighborhood relations.• Recurrent neural network layers – Exploiting sequential relations.
Thinking in Macro Structures• Mixing things up – Generating purpose modules.• LSTMs and Gating – Simple memory management.• Attention – Dynamic context driven information selection.• Inception - Dynamic receptive field expansion.• Residual Units – Building ultra deep structures.
End-to-End model design• Example for design choices.• Real examples.
![Page 14: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/14.jpg)
Deep LearningBasic Building Blocks
![Page 15: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/15.jpg)
2016/06/01
Neural Network BasicsLinear Regression
INPUTS OUTPUT
1
![Page 16: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/16.jpg)
2016/06/01
Neural Network BasicsLogistic Regression
INPUTS OUTPUT
1
![Page 17: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/17.jpg)
2016/06/01
Neural Network BasicsMulti-Layer Perceptron
INPUTS OUTPUTHIDDEN
1
...)(relu)tanh(
)(
ˆ)(
)2()1()2(
)1()1()1(
zz
z
bhWybxWh
=
+=
+=
f
f
With activation function:
1
Output Layer
Hidden Layer
![Page 18: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/18.jpg)
2016/06/01
Neural Network BasicsActivation Functions
INPUTS OUTPUTHIDDEN
11
ActivationFunction
Task
identity(h) Regressionlogistic(h) Binary
Classificationsoftmax(h) Multi-Class
Classification
Nice overview on activation functions:
https://en.wikipedia.org/wiki/Activation_function
ActivationFunctionidentity(h)tanh(h)
relu(h)
…
![Page 19: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/19.jpg)
2016/06/01
Basic Building BlocksThe Fully Connected Layer
INPUTS HIDDEN
1 kkmmn
kkmm
RbRWRXbXWH
RbRWRxbxWh
´´´
´´´
ÎÎÎ
+=
ÎÎÎ
+=
1)1()1(
)1()1()1(
1)1()1(1
)1()1()1(
,,)(
,,)(
f
f
Passing one example:
Passing n examples:
÷÷÷÷÷
ø
ö
ççççç
è
æ
343
175231
K
MMMM
K
K
( )231 K
![Page 20: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/20.jpg)
2016/06/01
Basic Building BlocksThe Fully Connected Layer – Stacking
INPUTS HIDDEN
11
1
HIDDEN
)(...
)()(
)()1()()(
)2()1()2()2(
)1()1()1(
llll bhWh
bhWhbxWh
+=
+=
+=
-f
f
f
. . .
43)1( ´Î RW 34)2( ´Î RW
![Page 21: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/21.jpg)
2016/06/01
Basic Building BlocksThe Fully Connected Layer – Using Brute Force
INPUTS HIDDEN
1
Brute force layer:• Exploits no assumptions about the inputs.ØNo weight sharing.
•Simply combines all inputs with each other.ØExpensive! Often responsible for the largest amount ofparameters in a deep learning model.
•Use with care since it can quickly over-parameterize the modelØCan lead to degenerated solutions.
Examples:
Two consecutive fully connected layer with 1000 hidden neuronseach: 1,000,000 free parameters!
3,000,000 free parametersfor a fully connected layerwith 100 hidden units!
RGB image of shape100 x 100 x 3
![Page 22: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/22.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer - Convolution of Filters
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
![Page 23: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/23.jpg)
2016/06/01
Basic Building Blocks(Valid) Convolution of Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
bxwh zjuiji
jizu +×= ++å ,,
,,
![Page 24: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/24.jpg)
2016/06/01
Basic Building Blocks(Valid) Convolution of Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
bxwh zjuiji
jizu +×= ++å ,,
,,
![Page 25: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/25.jpg)
2016/06/01
Basic Building Blocks(Valid) Convolution of Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
bxwh zjuiji
jizu +×= ++å ,,
,,
![Page 26: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/26.jpg)
2016/06/01
Basic Building Blocks(Valid) Convolution of Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
bxwh zjuiji
jizu +×= ++å ,,
,,
![Page 27: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/27.jpg)
2016/06/01
Basic Building Blocks(Valid) Convolution of Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER
FEATURE MAP
width
heig
ht
bxwh zjuiji
jizu +×= ++å ,,
,,
![Page 28: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/28.jpg)
2016/06/01
Basic Building Blocks(Valid) Convolution of Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER
FEATURE MAP
width
heig
ht
bxwh zjuiji
jizu +×= ++å ,,
,,
![Page 29: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/29.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer – Single Filter
width
heig
ht
INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
1331 ´´´Layer weights:
![Page 30: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/30.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer – Multiple Filters
width
heig
ht
INPUT IMAGE
width
heig
ht
k-FILTERS k-FEATURE MAPS
width
heig
ht
4331 ´´´Layer weights:
![Page 31: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/31.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer – Multi-Channel Input
width
heig
ht
k-Channel INPUT IMAGE
width
heig
ht
FILTER FEATURE MAP
width
heig
ht
1333 ´´´Layer weights:
![Page 32: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/32.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer – Multi-Channel Input
width
heig
ht
k-Channel INPUT IMAGE
width
heig
ht
k-FILTERS k-FEATURE MAPS
width
heig
ht
4333 ´´´Layer weights:
![Page 33: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/33.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer - Stacking
width
heig
ht
t-FILTERSk-FEATURE MAPS
width
heig
ht
width
heig
ht
t-FEATURE MAPS
tk ´´´ 33Layer weights:
![Page 34: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/34.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer – Receptive Field Expansion
T A A A T C T G G T C
Convolutional Filter (1 x 3)
Single Node = Output of filter that moves over patches
Single Patch
Feature Map after applying a 1 x 3 filter
0 T A A A T C T G G T C 0
![Page 35: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/35.jpg)
2016/06/01
0 T A A A T C T G G T C 0
0 0
Zero Padding
Convolutional Layer – Receptive Field Expansion.
00
Expansion of the receptive field for a 1 x 3 filter: 2i +1
![Page 36: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/36.jpg)
2016/06/01
0 T A A A T C T G G T C 0
0 0
Zero Padding
Convolutional Layer – Receptive Field Expansion.
00
Expansion of the receptive field for a 1 x 3 filter: 2i +1
![Page 37: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/37.jpg)
2016/06/01
0 T A A A T C T G G T C 0
0 0
Zero Padding
Convolutional Layer – Receptive Field Expansion.
00
Expansion of the receptive field for a 1 x 3 filter: 2i +1
![Page 38: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/38.jpg)
2016/06/01
0 T A A A T C T G G T C 0Zero Padding
Convolutional Layer – Receptive Field Expansion with Pooling Layer.
00Pooling
Layer(width 2, stride 2)
![Page 39: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/39.jpg)
2016/06/01
0 T A A A T C T G G T C 0Zero Padding
Convolutional Layer – Receptive Field Expansion with Pooling Layer.
00Pooling
Layer
![Page 40: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/40.jpg)
2016/06/01
0 T A A A T C T G G T C 0Zero Padding
Convolutional Layer - Receptive Field Expansion with Pooling Layer.
00Pooling
Layer
![Page 41: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/41.jpg)
2016/06/01
0 T A A A T C T G G T C 0
0 0 0 0
0 00 0
Zero Padding
Convolutional Layer – Receptive Field Expansion with Dilation
Expansion of the receptive field for a 1 x 3 filter: 2i+1 - 1
Paper https://arxiv.org/pdf/1511.07122.pdf
![Page 42: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/42.jpg)
2016/06/01
0 T A A A T C T G G T C 0
0 0 0 0
0 00 0
Zero Padding
Convolutional Layer - Receptive Field Expansion with Dilation
Expansion of the receptive field for a 1 x 3 filter: 2i+1 - 1
![Page 43: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/43.jpg)
2016/06/01
0 T A A A T C T G G T C 0
0 0 0 0
0 00 0
Zero Padding
Convolutional Layer - Receptive Field Expansion with Dilation
Expansion of the receptive field for a 1 x 3 filter: 2i+1 - 1
![Page 44: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/44.jpg)
2016/06/01
Basic Building BlocksConvolutional Layer – Exploiting Neighborhood Relations
INPUTS FEATUREMAP
Convolutional layer:• Exploits neighborhood relations of the inputs (e.g. spatial).• Applies small fully connected layers to small patches of the input.ØVery efficient!ØWeight sharingØNumber of free parameters
•The receptive field can be increased by stacking multiple layers•Should only be used if there is a notion of neighborhood in the input:
•Text, images, sensor time-series, videos, …
Example:2,700 free parameters for aconvolutional layer with 100hidden units (filters) with afilter size of 3 x 3!RGB image of shape
100 x 100 x 3
filters#thfilter widheightfilterchannelsinput# ´´´
1
![Page 45: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/45.jpg)
2016/06/01
Basic Building BlocksRecurrent Neural Network Layer – The RNN cell
+RNN CELL
INPUT t
ϕFC
FC
STAT
Eh t
STATE ht
FC = Fully connected layer+ = AdditionΦ = Activation function
INPUTS HIDDEN
1
State HIDDEN
)( 1 bWxUhh ttt ++= -f
![Page 46: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/46.jpg)
2016/06/01
Basic Building BlocksRecurrent Neural Network layer – Unfolded
+0RNN CELL
INPUT 0
ϕFC
FC
+RNN CELL
INPUT 1
ϕFC
FC
STAT
E0
STATE0 +
RNN CELL
INPUT T
ϕFC
FC
…
…
STATE1
STATET -1
FC = Fully connected layer+ = AdditionΦ = Activation function
STAT
E1
STAT
ET
SEQUENTIAL INPUT
SEQUENTIAL OUTPUT
![Page 47: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/47.jpg)
2016/06/01
Basic Building BlocksVanilla Recurrent Neural Network (unfolded)
+0RNN CELL
INPUT 0
ϕFC
FC
FC
OUTPUT 0
+RNN CELL
INPUT 1
ϕFC
FC
FC
OUTPUT 1
STAT
E0
STATE0 +
RNN CELL
INPUT T
ϕFC
FC
FC
OUTPUT T
…
…
STATE1
STATET -1
SEQUENTIAL INPUT
SEQUENTIAL OUTPUTFC = Fully connected layer+ = AdditionΦ = Activation function
STAT
E1
STAT
ET
)( bVhy tt += fOutput layer:
![Page 48: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/48.jpg)
2016/06/01
Basic Building BlocksRecurrent Neural Network Layer – Stacking
+0RNN CELL
INPUT 0
ϕFC
FC
+RNN CELL
INPUT 1
ϕFC
FC
STATE0,0
STATE0,0 +
RNN CELL
INPUT T
ϕFC
FC
…
…
STATE0,1
STATE0,T -1
STATE0,1
STATE0,T
+0RNN CELL
ϕFC
FC
+RNN CELL
ϕFC
FCST
ATE
1,0
STATE1,0 +
RNN CELL
ϕFC
FC
…STATE1,1
STATE1,T -1
STAT
E1,
1
STAT
E1,
T
![Page 49: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/49.jpg)
2016/06/01
Basic Building BlocksRecurrent Neural Network Layer – Exploiting Sequential Relations
RNN layer:• Exploits sequential dependencies (Next prediction might depend onthings that were observed earlier).•Applies the same (via parameter sharing) fully connected layer toeach step in the input data and combines it with collected informationfrom the past (hidden state).ØDirectly learns sequential (e.g. temporal) dependencies.
•Stacking can help to learn deeper hierarchical state representations.•Should only be used if sequential sweeping of the data makessense: Text, sensor time-series, (videos, images)…
•Vanilla RNN is not able to capture long-time dependencies!•Use with care since it can also quickly over-parameterize the model
ØCan lead to degenerated solutions.
100 x100100 x100100 x100100 x100100 x100
e.g. Videos of frames ofshape100 x 100 x 3
+RNN CELL
INPUT t
ϕFC
FC
STAT
Eh t
STATE ht
![Page 50: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/50.jpg)
Deep LearningThinking in Macro Structures
![Page 51: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/51.jpg)
2016/06/01
Thinking in Macro StructuresRemember the Important Things – And Move On
In case no assumptions on the inputdata can be exploited. (Treat allinputs as independent)
FC +
Good for exploiting spatial/sequentialdependencies in the data.CNN +
Good for modeling sequential datawith no long term dependenciesRNN +
Fully Connected Layer
Convolutional Layer
Recurrent Neural Network Layer
With these three basic building blocks, we are already able to do amazing stuff!
Important things:•Purpose•Weaknesses•General usage•Tweaks
![Page 52: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/52.jpg)
2016/06/01
Thinking in Macro StructuresMixing Things Up – Generating Purpose Modules.
• Given the basic building blocks introduced in the last section:
• We can construct modules that address certain sub-task within the model that might bebeneficial for reaching the actual target goal.• E.g. Gating, Attention, Hierarchical feature extraction, …
• These modules can further be combined to form even larger modules serving a more complexpurpose• LSTMs, Residual Units, Fractal Nets, Neural memory management …
• Finally all things are further mixed up to form an architecture with many internal mechanismsthat enables the model to learn very complex tasks end-to-end.• Text translation, Caption generation, Neural Computer…
![Page 53: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/53.jpg)
2016/06/01
Thinking in Macro StructuresControlling the Information Flow – Gating
Image: https://au.pinterest.com/explore/tap-valve/ TARGET INPUT
+ σFCCONTEXT
PASSED INPUT
GATE
σ= sigmoid function
Note:Gate and input haveequal shapes
![Page 54: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/54.jpg)
2016/06/01
Thinking in Macro StructuresControlling the Information Flow – Gating in Recurrent Neural Network Cells
+ σFCSTATE ht-1
GATE
σ= sigmoid function
Note:Gate and input haveequal shapes
FC
INPUT t
RNNCELL
+ STATE ht
Goal: Control how much information from the current input impacts the hidden state representation.Note: This made up example cell shows only the principle but would not work in practice since we would also need to control the information flow from the
previous state representation to the next (forget gate).
CELL with Input Gate
)(
)(
1
1
RNNtRNNtRNNtt
ititit
bhUxWih
bhUxWi
++×=
++=
-
-
f
sInput gate:
New state representation
![Page 55: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/55.jpg)
2016/06/01
Thinking in Macro StructuresRemember the Important Things – And Move On.
GateGood for controllinginformation flow ina network
+
![Page 56: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/56.jpg)
2016/06/01
ttt
ototot
ctctctttt
itittt
ftfttt
cohbhUxWo
bhUxWicfcbhUxWi
bhUxWf
×=++=
++×+×=++=
++=
-
--
-
-
)()(
)(
)(
1
11
1
1
sf
s
s
Thinking in Macro StructuresLearning Long-Term Dependencies – The LSTM Cell
Forget gate
Input gate
Output gate
![Page 57: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/57.jpg)
2016/06/01
CELL STATE ct
Thinking in Macro StructuresLearning Long-Term Dependencies – The LSTM Cell
LSTM CELL
GATE
INPUT t
GATE
RNNCELL
+
GATE
STAT
Eh tSTATE ht
ttt
ototot
ctctctttt
itittt
ftfttt
cohbhUxWo
bhUxWicfcbhUxWi
bhUxWf
×=++=
++×+×=++=
++=
-
--
-
-
)()(
)(
)(
1
11
1
1
sf
s
sForget gate
Input gate
Output gate
![Page 58: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/58.jpg)
2016/06/01
Thinking in Macro StructuresLearning Long-Term Dependencies – The LSTM Cell
STATEh
t-1
…
…
LSTM CELL
GATE
INPUT t-1
GATE
RNNCELL
+
GATEST
ATE
h t-1
CELL STATE ct-1 LSTM CELL
GATE
INPUT t
GATE
RNNCELL
+
GATE
STAT
Eh t
CELL STATE ct-2
STATE ht-2
![Page 59: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/59.jpg)
2016/06/01
Thinking in Macro StructuresRemember the Important Things – And Move On.
LSTMGood for modelinglong term dependenciesin sequential data
+
PS: Same accounts for Gated Recurrent UnitsVery good blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
![Page 60: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/60.jpg)
2016/06/01
Thinking in Macro StructuresLearning to Focus on the Important Things – Attention
Image:http://w
ww
.newsw
orks.org.uk/Topics-themes/87726
LSTMCell
LSTMCell
LSTMCell
LSTMCell
LSTMCell0
INPUT 0 INPUT 1 INPUT 2 INPUT 3 INPUT 4
![Page 61: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/61.jpg)
2016/06/01
Thinking in Macro StructuresLearning to Focus on the Important Things – Attention
CONTEXT 1
ATTENTION
What ever = any functionthat maps some input to ascalar. Often a multi layerneural network that islearned with the rest.
INPUT 1
Whatever
α1
CONTEXT 2
INPUT 2
Whatever
α2
CONTEXT k
INPUT k
Whatever
αn
+
…
…
å =i
i 1aweighted sum of inputs
![Page 62: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/62.jpg)
2016/06/01
LSTMCell
LSTMCell
Thinking in Macro StructuresLearning to Focus on the Important Things – Attention
ATTENTION
AttentionNN
α1
AttentionNN
α2
AttentionNN
αn
+
…
…
å =i
i 1a
LSTMCell0
Goal: Filter out unimportant words for the target task.
A Cat Mat
STAT
Eh 1
STAT
Eh t
STAT
Eh 0
weighted sum of inputs
Expectation:Learns to measure the differencebetween the previous and currentstate representation:Low difference = nothing new orimportant => low weight α
![Page 63: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/63.jpg)
2016/06/01
Thinking in Macro StructuresRemember the Important Things – And Move On.
AttentionGood for learninga context sensitiveselection process
+
Interactive explanation: http://distill.pub/2016/augmented-rnns/
![Page 64: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/64.jpg)
2016/06/01
Thinking in Macro StructuresDynamic Receptive Fields – The Inception Architecture
CONV1 x 1
CONV3 x 3
CONV5 x 5
CONV1 x 1
CONV1 x 1
CONV1 x 1
Pool3 x 3
INPUT
Concatenate
•Provides the model with a choice of variousfilter sizes.
•Allows the model to combine different filtersizes.
![Page 65: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/65.jpg)
2016/06/01
Thinking in Macro StructuresDynamic Receptive Fields – The Inception Architecture
INPUT
•Allows model to explicitly learn its “own”receptive field expansion.
•Allows the model to more explicitly learn differentlevels of receptive field expansion at the sametime:ØMight result in a more diverse set of
hierarchical features available in each layerINCEPTION
INCEPTION
INCEPTION
![Page 66: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/66.jpg)
2016/06/01
Thinking in Macro StructuresRemember the Important Things – And Move On.
InceptionGood for learningcomplex and dynamicreceptive field expansion
+
Paper https://arxiv.org/abs/1409.4842
![Page 67: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/67.jpg)
2016/06/01
Thinking in Macro StructuresRemember the Important Things – And Move On.
ResidualUnits
Good for learningvery deep networks+
Paper: https://arxiv.org/pdf/1603.05027.pdf
![Page 68: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/68.jpg)
Deep LearningEnd-to-End Model Design
![Page 69: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/69.jpg)
2016/06/01
End-to-End Model DesignDesign Choices - Video Classification Example
0 LSTMCell
Frame 0
CNN
LSTMCell
Frame 1
CNN
LSTMCell
Frame 2
CNN
Frame T
AttentionNN
AttentionNN
AttentionNN
LSTMCell
CNN
AttentionNN
+
FC
…
…
…
… Hierarchical feature extractionand input compression
Capturing temporal dependencies
Remove unimportant frames
Classification[Example for design choices]
![Page 70: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/70.jpg)
2016/06/01
End-to-End Model DesignReal Examples - Deep Face
Image:
Hachim El Khiyari, Harry Wechsler
Face Recognition across Time Lapse Using Convolutional Neural Networks
Journal of Information Security, 2016.
![Page 71: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/71.jpg)
2016/06/01
End-to-End Model DesignReal Example - Multi-Lingual Neural Machine Translation
Yonghui Wu, et. al. Google's Neural Machine Translation System: Bridging the Gap betweenHuman and Machine Translation. https://arxiv.org/abs/1609.08144. 2016
![Page 72: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/72.jpg)
2016/06/01
End-to-End Model DesignReal Examples – Wave Net
Van den Oord et al. Wave Net: A Generative Model for Raw Audio.
https://arxiv.org/pdf/1609.03499.pdf
![Page 73: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/73.jpg)
Deep LearningPart II
Deep Learning Model Training
![Page 74: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/74.jpg)
2016/06/01
Part II – Training Deep Learning Models
Loss Function Design•Basic Loss functions•Multi-Task Learning
Optimization• Optimization in Deep Learning• Work-horse Stochastic Gradient Descent• Adaptive Learning Rates
Regularization• Weight Decay• Early Stopping• Dropout• Batch Normalization
Distributed Training• Not covered, but I included a link to a good overview.
![Page 75: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/75.jpg)
Deep LearningLoss Function Design
![Page 76: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/76.jpg)
2016/06/01
Loss Function DesignRegression
Mean Squared Loss
Network output can be anything:ØUse no activation function in output layer!
Example ID Target Value (yi) Prediction (fθ(xi)) Example Error1 4.2 4.1 0.012 2.4 0.4 43 -2.9 -1.4 2.25… … … …n 0 1.0 1.0
å -=n
iii
Rloss xfy
nXYf 2))((1),,( qq
![Page 77: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/77.jpg)
2016/06/01
Loss Function DesignBinary Classification
Binary Cross Entropy (also called Log Loss)
Network output needs to be between 0 and 1:ØUse sigmoid activation function in the output layer!ØNote: Sometimes there are optimized functions available that operate on the raw outputs (logits)
[ ]å -×-+×-=n
iiiii
BCloss xfyxfy
nXYf ))(1(log)1())(log(1),,( qqq
Example ID Target Value (yi) Prediction (fθ(xi)) Example Error1 0 0.211 0.2372 1 0.981 0.0193 0 0.723 1.284… … … …n 0 0.134 0.144
![Page 78: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/78.jpg)
2016/06/01
Cross Entropy (Essentially the same as Perplexity in NLP)
Network output needs to represent a probability distribution over c classes:ØUse softmax activation function in the output layer!ØNote: Sometimes there are optimized functions available that operate on the raw outputs (logits)
åå ×-=n
i
c
jjiiji
MCCloss xfy
nXYf ))(log(1),,( ,, qq
Loss Function DesignMulti-Class Classification
Example ID Target Value (yi) Prediction (fθ(xi)) Example Error1 [0, 0, 1] [0.2, 0.2, 0.6] 0.5112 [1, 0, 0] [0.3, 0.5, 0.2] 1.203 [0, 1, 0] [0.1, 0.7, 0.3] 0.511… … … …n [0, 0, 1] [0.0, 0.01, 0.99] 0.01
1)( , =åc
jjiixfq
![Page 79: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/79.jpg)
2016/06/01
Multi-Label classification loss function (Just sum of Log Loss for each class)
Each network output needs to be between 0 and 1:ØUse sigmoid activation function on each network output!ØNote: Sometimes there are optimized functions available that operate on the raw outputs (logits)
))(1log()1())(log(1),,( ,,,, jiiji
n
i
c
jjiiji
MLCloss xfyxfy
nXYf qqq -×-+×-= åå
Loss Function DesignMulti-Label Classification
Example ID Target Value (yi) Prediction (fθ(xi)) Example Error1 [0, 0, 1] [0.2, 0.4, 0.6] 1.2452 [1, 0, 1] [0.3, 0.9, 0.2] 5.1163 [0, 1, 0] [0.1, 0.7, 0.1] 0.567… … … …n [1, 1, 1] [0.8, 0.9, 0.99] 0.339
![Page 80: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/80.jpg)
2016/06/01
Additive Cost Function
Each network output has associated input and target data and an associated loss metric:ØUse proper output activation for each of the k output layer!ØThe weighting λk of each task in the cost function is derived from prior knowledge/assumptions or by trial
and error.ØNote that we could learn multiple tasks from the same data. This can be represented by copies of the
corresponding data in the formula above. When implementing this, we would of course not copy the data.
Examples:• Auxiliary heads for counteracting vanishing gradient (Google LeNet, https://arxiv.org/abs/1409.4842)• Artistic style transfer (Neural Artistic Style Transfer, https://arxiv.org/abs/1508.06576)• Instance segmentation (Mask R-NN, https://arxiv.org/abs/1703.06870)• …
Loss Function DesignMulti-Task Learning
[ ] [ ] ( )qlq ,,),,...,,,...,( 00 kk
K
klosskKK
MTloss XYfXXYYf
kå=
![Page 81: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/81.jpg)
Deep LearningOptimization
![Page 82: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/82.jpg)
2016/06/01
OptimizationLearning the Right Parameters in Deep Learning
• Neural networks are composed of differentiable building blocks
• Training a neural network means minimization of some non-convexdifferentiable loss function using iterative gradient-based optimization methods
• The simplest but mostly used optimization algorithm is “gradient descent”
![Page 83: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/83.jpg)
2016/06/01
Gradient with respect to the modelparameters θ
),,(with 1
1
-
-
Ñ=
×-=
tlosst
ttt
XYfg
g
q
hqq
q
OptimizationGradient Descent
We update the parameters a little bit in thedirection where the error gets smaller
Negative Gradient
Positive Gradient
θt
θt
You can think of thegradient as the localslope with respect toeach parameter θi atstep t.
![Page 84: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/84.jpg)
2016/06/01
),,(with 1)()()(
)(1
-
-
Ñ=
×-=
tss
losss
t
sttt
XYfg
g
q
hqq
q
Gradient with respect to the modelparameters θ
OptimizationWork-Horse Stochastic Gradient Descent
We update the parameters a little bit in thedirection where the error gets smaller
Stochastic Gradient Descent is GradientDescent on samples (Mini-Batches) ofdata:• Increases variance in the gradients
ØSupposedly helps to jump out of local minima
• But essentially, it is just super efficient and itworks! In the following we will omit the superscript s
and X will always represent a mini-batch ofsamples from the data.
![Page 85: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/85.jpg)
2016/06/01
OptimizationComputing the Gradient
I have to compute thegradient of that???
Sounds complicated!
Error
),,(with 1
1
-
-
Ñ=
×-=
tlosst
ttt
XYfg
g
q
hqq
q Image:
Hachim El Khiyari, Harry Wechsler
Face Recognition across Time Lapse Using Convolutional Neural Networks
Journal of Information Security, 2016.
![Page 86: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/86.jpg)
2016/06/01
OptimizationAutomatic Differentiation
),,(with 1
1
-
-
Ñ=
×-=
tlosst
ttt
XYfg
g
q
hqq
q
![Page 87: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/87.jpg)
2016/06/01
OptimizationAutomatic Differentiation
AUTOMATIC DIFFERENTIATIONIS AN
EXTREMELY POWERFUL FEATUREFOR DEVELOPING MODELS WITH
DIFFERENTIABLEOPTIMIZATION OBJECTIVES
![Page 88: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/88.jpg)
2016/06/01
OptimizationWait a Minute, I thought Neural Networks are Optimized via Backpropagation
Backpropagation is just a fancy name forapplying the chain rule to compute the
gradients in neural networks!
![Page 89: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/89.jpg)
2016/06/01
OptimizationStochastic Gradient Descent – Problems with Constant Learning Rates
Excellent Overview and Explanation: http://sebastianruder.com/optimizing-gradient-descent/
1
2
3
4
Low gradient
Steep gradient
Flat gradient
Gradients of different parameters vary
ttt g×-= hqq
![Page 90: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/90.jpg)
2016/06/01
OptimizationStochastic Gradient Descent – Problems with Constant Learning Rates
Excellent Overview and Explanation: http://sebastianruder.com/optimizing-gradient-descent/
1
2
3
4
Learning rate too small
Learning rate too large
Get stuck in zero gradient regions
Learning rate can be parameter specific
ttt g×-= hqq
![Page 91: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/91.jpg)
2016/06/01
OptimizationStochastic Gradient Descent – Adding Momentum
Excellent Overview and Explanation: http://sebastianruder.com/optimizing-gradient-descent/
1
2
3
Step size can accumulates momentum ifsuccessive gradients have same direction
Step size decreases fast if the direction of thegradients changes
Momentum only decays slowly and does not stop immediately
),,(with 1
1
1
-
-
-
Ñ=-=+=
tlosst
ttt
ttt
XYfgv
gvv
qqq
hl
q
Decay (“friction”) Constant learning rate
Adding the previous step size can lead to acceleration
√
√
√
![Page 92: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/92.jpg)
2016/06/01
OptimizationStochastic Gradient Descent – Adaptive Learning Rate (RMS Prop)
Excellent Overview and Explanation: http://sebastianruder.com/optimizing-gradient-descent/
4For each parameter an individual learning rate iscomputed
e
hh
bb
hqq
+=¢
×--×=
×¢-=
-
ti
i
ittiti
itiitit
gE
ggEgE
g
][
)1(][][
2
2,1
22
,,,
Update rule with an individual learning rate for each parameter θi
The learning rate is adapted by a decaying mean of past updates
The correction of the (constant) learning rate for each parameter.The epsilon is only for numerical stability
1
Continuously low gradient will increase the learningrate
2
Continuously large gradients will result in a decreaseof the learning rate
√
√
√
![Page 93: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/93.jpg)
2016/06/01
OptimizationStochastic Gradient Descent – Overview Common Step Rules
Excellent Overview and Explanation: http://sebastianruder.com/optimizing-gradient-descent/
1
2
3
4
ConstantLearningRate
ConstantLearningRate withAnnealing
Momentum AdaDelta RMSProp RMSProp+
Momentum
ADAMNesterov
√
√
√
√
√
√
√ √
√
√
√
√
√
√
√
√
This does not mean thatit cannot make sense touse only a constantlearning rate!
√
√
√
√
√
![Page 94: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/94.jpg)
2016/06/01
OptimizationSomething feels terribly wrong here, can you see it?
),,(with 1
1
-
-
Ñ=
×-=
tlosst
ttt
XYfg
g
q
hqq
q
![Page 95: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/95.jpg)
Deep LearningRegularization
![Page 96: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/96.jpg)
2016/06/01
RegularizationWhy Regularization is Important
• The goal of learning is not to find asolution that explain the training dataperfectly.
• The goal of learning is to find a solutionthat generalizes well on unseen datapoints.
• Regularization tries to prevent the modelto just fit the training data in an arbitraryway (overfitting).
Training DataTest Data
![Page 97: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/97.jpg)
2016/06/01
RegularizationWeight Decay – Constraining Parameter Values
Intuition:•Discourage the model for choosing undesired values for parameters during learning.
General Approach:•Putting prior assumptions on the weights. Deviations from these assumptions get penalized.
Examples:L2 –Regularization (Squared L2 norm or Gaussian Prior) L1-Regularization
The regularization term is just added to the cost function for the training.
λ is a tuning parameter that determines how strong the regularization affects learning.
å=ji
ji,
2,
2
2)(qq
2
2),,(),,( qlqq ×+= XYfXYf loss
totalloss
å=ji
ji,
,1qq
![Page 98: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/98.jpg)
2016/06/01
RegularizationEarly Stopping – Stop Training Just in Time.
Problem• There might be a point during training where the
model starts to overfit the training data at the cost ofgeneralization.
Approach• Separate additional data from the training data and
consistently monitor the error on this validationdataset.
• Stop the training if the error on this dataset does notimprove or gets worse over a certain amount oftraining iterations.
• It is assumed that the validation set approximatesthe models generalization error (on the test data).
Stop!
(Unknown) Test ErrorValidation ErrorTraining Error
Training iterations
Highly idealistic view on how early stopping (should) work
![Page 99: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/99.jpg)
2016/06/01
RegularizationDropout – Make Nodes Expendable
Problem• Deep learning models are often highly over
parameterized which allows the model tooverfit on or even memorize the training data.
Approach• Randomly set output neurons to zeroØTransforms the network into an ensemble
with an exponential set of weakerlearners whose parameters are shared.
Usage• Primarily used in fully connected layers
because of the large number of parameters• Rarely used in convolutional layers• Rarely used in recurrent neural networks (if at
all between the hidden state and output)
11
1
INPUTS HIDDEN HIDDEN
. . .
![Page 100: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/100.jpg)
2016/06/01
RegularizationDropout – Make Nodes Expendable
11
1
INPUTS HIDDEN HIDDEN
. . .
0.0
0.0
Problem• Deep learning models are often highly over
parameterized which allows the model tooverfit on or even memorize the training data.
Approach• Randomly set output neurons to zeroØTransforms the network into an ensemble
with an exponential set of weakerlearners whose parameters are shared.
Usage• Primarily used in fully connected layers
because of the large number of parameters• Rarely used in convolutional layers• Rarely used in recurrent neural networks (if at
all between the hidden state and output)
![Page 101: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/101.jpg)
2016/06/01
RegularizationDropout – Make Nodes Expendable
11
1
INPUTS HIDDEN HIDDEN
. . .
0.0
0.00.0
0.0
Problem• Deep learning models are often highly over
parameterized which allows the model tooverfit on or even memorize the training data.
Approach• Randomly set output neurons to zeroØTransforms the network into an ensemble
with an exponential set of weakerlearners whose parameters are shared.
Usage• Primarily used in fully connected layers
because of the large number of parameters• Rarely used in convolutional layers• Rarely used in recurrent neural networks (if at
all between the hidden state and output)
![Page 102: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/102.jpg)
2016/06/01
Normalize the input X of layer k by the mini-batchmoments:
The next step gives the model the freedom oflearning to undo the normalization if needed:
The above two steps in one formula.
Note: At inference time, an unbiased estimate ofthe mean and standard deviation computed fromall seen mini-batches during training is used.
)(
)()()(
)(
)()()(
)()()()(
)(
)()()(
~
ˆ~
ˆ
kX
kXkk
kX
kkk
kkkk
kX
kX
kk
XX
XX
XX
smgb
sg
bg
sm
×-+×=
+=
-=
RegularizationBatch Normalization – Avoiding Covariate Shift
Problem• Deep neural networks suffer from internal
covariate shift which makes training harder.
Approach• Normalize the inputs of each layer (zero mean,
unit variance)ØRegularizes because the training network is
no longer producing deterministic values ineach layer for a given training example
Usage• Can be used with all layers (FC, RNN, Conv)• With Convolutional layers, the mini-batch
statistics are computed from all patches in themini-batch.
![Page 103: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/103.jpg)
Deep LearningDistributed Training
![Page 104: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/104.jpg)
http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
![Page 105: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/105.jpg)
Deep LearningPart III
Deep Learning and Artificial (General) Intelligence
![Page 106: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/106.jpg)
2016/06/01
Part III – Deep Learning and Artificial (General) Intelligence
Deep Reinforcement Learning• Brief introduction to the problem setting.• End-to-End models for control• Resources
Deep Learning as Building Block for Artificial Intelligence• Think it over - Not all classifications happen in an blink of an eye.• Store and retrieve important information dynamically – Managing explicit memories• Considering long-term consequences - Simulating before acting• Being a multi talent – Multi-task learning and transfer learning
![Page 107: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/107.jpg)
Deep Learning + Reinforcement Learning=
Deep Reinforcement Learning
![Page 108: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/108.jpg)
2016/06/01
Deep Reinforcement LearningThe Reinforcement Learning Setting
![Page 109: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/109.jpg)
2016/06/01
Deep Reinforcement LearningThe Reinforcement Learning Setting
Carefully and often manuallydesigned state representation
![Page 110: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/110.jpg)
2016/06/01
Deep Reinforcement LearningModel Free Deep Reinforcement Learning
Use deep learning toautomatically extract
meaningful features from thestate representation.
Note: ‚Raw‘ is not meant literally here, youstill have to preprocess the data in areasonable way.
State representation consistsof ‘raw’ observations from the
environment.
![Page 111: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/111.jpg)
2016/06/01
Deep Reinforcement LearningModel Based Deep Reinforcement Learning
Use deep learning toautomatically extract
meaningful features from thestate representation.
Note: ‚Raw‘ is not meant literally here, youstill have to preprocess the data in areasonable way.
State representation consistsof ‘raw’ observations from the
environment.Use Deep Learning to learn asimulator for the environment
![Page 112: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/112.jpg)
2016/06/01
Deep Reinforcement LearningModel Based Deep Reinforcement Learning
Use deep learning to performplanning.
Note: ‚Raw‘ is not meant literally here, youstill have to preprocess the data in areasonable way.
State representation consistsof ‘raw’ observations from the
environment.Use Deep Learning to learn asimulator for the environment
![Page 113: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/113.jpg)
2016/06/01
Deep Reinforcement LearningEnd-to-End Models for Control
See also: Facebook and Intel reign supreme in 'Doom' AI deathmatch.https://www.engadget.com/2016/09/22/facebook-and-intel-reign-supreme-in-doom-ai-deathmatch/
Moving away from feeding carefully extracted manual (state) features into the models.
![Page 114: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/114.jpg)
2016/06/01
Deep Reinforcement LearningResources
https://gym.openai.com/
![Page 115: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/115.jpg)
Deep Learning as Building Blockfor
Artificial Intelligence
![Page 116: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/116.jpg)
2016/06/01
Deep Learning as Building Block for AIThink it over: Not All Decisions Happen in an Blink of An Eye.
Hierarchical feature extraction in the visual cortex.
“For some types of tasks (e.g. for images presented briefly and out of context), it is thought that visual processingin the brain is hierarchical–one layer feeds into the next, computing progressively more complex features. This isthe inspiration for the “layered” design of modern feed-forward neural networks.” Image (c) Jonas Kubilias
![Page 117: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/117.jpg)
2016/06/01
Deep Learning as Building Block for AINon-Stationary Feed-forward Passes in Deep Neural Networks.
initi
alliz
edas
1ve
ctor
a0
t = 1
o1
a1
t = 2
o2
Sampled from a Gaussiandistribution, which is
learned during training.
Softmax
Multiple passes of an image through a network to reevaluate the final decision
Deep Networks with Internal Selective Attentionthrough Feedback Connections
Marijn Stolenga, Jonathan Masci, FaustinoGomez, Juergen Schmidhuber.https://arxiv.org/abs/1407.3068. 2014
![Page 118: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/118.jpg)
2016/06/01
Deep Learning as Building Block for AINon-Stationary Feed-Forward Passes in Deep Neural Networks.
a1
t = 2
o2
Sampled from a Gaussiandistribution, which is
learned during training
Softmax
Final Output (Classification)
at-1
Multiple passes of an image through a network to reevaluate the final decision
![Page 119: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/119.jpg)
2016/06/01
Deep Learning as Building Block for AIStore and Retrieve Important Information Dynamically – Managing Explicit Memories
Hybrid computing using a neural network with dynamic external memory
Alex Graves et. Al. (Nature 2016)
![Page 120: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/120.jpg)
2016/06/01
Deep Learning as Building Block for AIConsidering Long-Term Consequences – Simulating Before Acting
![Page 121: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/121.jpg)
2016/06/01
Deep Learning as Building Block for AIPlanning in Perfect Information Games
![Page 122: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/122.jpg)
2016/06/01
Deep Learning as Building Block for AIDealing with Intractable Many Game States
https://blogs.loc.gov/maps/category/game-theory/
As in many real life settings, the wholegame tree cannot be explored. For thisreason we need automated methodsthat help to explore the game tree in areasonable way!
=> Deep Learning
http://paulomenin.github.io/go-presentation/images/goban.png
![Page 123: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/123.jpg)
2016/06/01
Deep Learning as Building Block for AIBeing a Multi-Talent – Multi Task Learning and Transfer learning
Progressive Neural Networks
Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, JamesKirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, Raia Hadsell
arXiv:1606.04671, 2016
Challenge:Today, we are able to train systems that sometimesshow super human performance in very complextasks. (E.g AlphaGO)
However, the same systems fail miserably whendirectly applied to any other (much simpler task).
![Page 124: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/124.jpg)
2016/06/01
Things we did not cover (not complete…)
Encoder-Decoder Networks
Generative adversarial networks
Variational Approaches
Layer Compression (e.g. Tensor-Trains)
Evolutionary Methods for Model Training
Pixel RNN/CNN
Dealing with Variable Length Inputs and Outputs
Maxout NetworksLearning to learn
Transfer Learning
Siamese NetworksSequence Generation
Sequence Generation
Mask R-CNN
Multi-Lingual Neural Machine Translation
Recursive Neural Networks
Character Level Neural Machine Translation
Neural Artistic Style Transfer
Neural Question Answering
Weight Normalization
Weight Sharing
(Unsupervised) pre-training
Highway Networks
Fractal Networks
Deep Q-LearningSpeech Modeling
Vanishing/Exploding Gradient
Hessian-free optimization
More loss functionsHyper-parameter tuning
Benchmark datasets
Mechanism for training ultra deep networks
Initialization
Pre-Training
Distributed Training
![Page 125: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/125.jpg)
Because it Works
Deep Learning
![Page 126: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/126.jpg)
2016/06/01
Recommended Material
http://cs231n.stanford.edu/http://cs231n.github.io
![Page 127: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/127.jpg)
2016/06/01
Recommended Material
http://cs224d.stanford.edu/
![Page 128: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/128.jpg)
2016/06/01
Recommended Material
INTRODUCTION• Tutorial on Neural Networks (Deep Learning and Unsupervised Feature
Learning): http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial• Deep Learning for Computer Vision lecture: http://cs231n.stanford.edu (http://cs231n.github.io)• Deep Learning for NLP lecture: http://cs224d.stanford.edu (http://cs224d.stanford.edu/syllabus.html)• Deep Learning for NLP (without magic) tutorial: http://lxmls.it.pt/2014/socher-lxmls.pdf (Videos from NAACL
2013: http://nlp.stanford.edu/courses/NAACL2013)• Bengio's Deep Learning book: http://www.deeplearningbook.org
![Page 129: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/129.jpg)
2016/06/01
Recommended Material
PARAMETER INITIALIZATION• Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural
networks." International Conference on Artificial Intelligence and Statistics. 2010.• He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on
ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1026-1034).
BATCH NORMALIZATION• Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift. In Proceedings of The 32nd International Conference on Machine Learning (pp. 448-456).• Cooijmans, T., Ballas, N., Laurent, C., & Courville, A. (2016). Recurrent Batch Normalization. arXiv preprint
arXiv:1603.09025.
DROPOUT• Hinton, Geoffrey E., et al. "Improving neural networks by preventing co-adaptation of feature detectors." arXiv preprint
arXiv:1207.0580 (2012).• Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine
Learning Research 15.1 (2014): 1929-1958.
![Page 130: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/130.jpg)
2016/06/01
Recommended Material
OPTIMIZATION & TRAINING• Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic
optimization. The Journal of Machine Learning Research, 12, 2121-2159.• Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701.• Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2.• Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep
learning. In Proceedings of the 30th International Conference on Machine Learning (ICML) (pp. 1139-1147).• Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980.• Martens, J., & Sutskever, I. (2012). Training deep and recurrent networks with hessian-free optimization. In Neural
networks: Tricks of the trade (pp. 479-535). Springer Berlin Heidelberg.
![Page 131: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/131.jpg)
2016/06/01
Recommended Material
COMPUTER VISION• Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems (pp. 1097-1105).• Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). DeepFace: Closing the gap to human-level performance in
face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1701-1708).• Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with
convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).• Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.• Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial transformer networks. In Advances in Neural Information
Processing Systems (pp. 2008-2016).• Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal
networks. In Advances in Neural Information Processing Systems (pp. 91-99).• Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015). Show, Attend and Tell: Neural
Image Caption Generation with Visual Attention. In Proceedings of The 32nd International Conference on MachineLearning (pp. 2048-2057).
• Johnson, J., Karpathy, A., & Fei-Fei, L. (2015). DenseCap: Fully Convolutional Localization Networks for DenseCaptioning. arXiv preprint arXiv:1511.07571.
![Page 132: Deep Learning - LMU Munich€¦ · Geoffrey Hinton Yann LeCun Yoshua Bengio Andrew Ng. ... parameters in a deep learning model. •Use with care since it can quickly over-parameterize](https://reader031.fdocuments.us/reader031/viewer/2022022606/5b7bb1887f8b9a483c8ea7f4/html5/thumbnails/132.jpg)
2016/06/01
Recommended Material
NATURAL LANGUAGE PROCESSING• Bengio, Y., Schwenk, H., Senécal, J. S., Morin, F., & Gauvain, J. L. (2006). Neural probabilistic language models.
In Innovations in Machine Learning (pp. 137-186). Springer Berlin Heidelberg.• Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing
(almost) from scratch. The Journal of Machine Learning Research, 12, 2493-2537.• Mikolov, T. (2012). Statistical language models based on neural networks (Doctoral dissertation, PhD thesis, Brno
University of Technology. 2012.)• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv
preprint arXiv:1301.3781.• Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases
and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119).• Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. In HLT-
NAACL (pp. 746-751).• Socher, R. (2014). Recursive Deep Learning for Natural Language Processing and Computer Vision (Doctoral
dissertation, Stanford University).• Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning
phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.• Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv
preprint arXiv:1409.0473.