Basic Structure: Fully Connected Feedforward...
Transcript of Basic Structure: Fully Connected Feedforward...
Network StructureHung-yi Lee
李宏毅
Step 1: Neural Network
Step 2: Cost Function
Step 3: Optimization
Three Steps for Deep Learning
Step 1. A neural network is a function composed of simple functions (neurons)
➢ Usually we design the network structure, and let machine find parameters from data
Step 2. Cost function evaluates how good a set of parameters is
Step 3. Find the best function set (e.g. gradient descent)
➢ We design the cost function based on the task
Outline
• Basic structure (3/03)
• Fully Connected Layer
• Recurrent Structure
• Convolutional/Pooling Layer
• Special Structure (3/17)
• Spatial Transformation Layer
• Highway Network / Grid LSTM
• Recursive Structure
• Batch Normalization
• Sequence-to-sequence / Attention (3/24)
Prerequisite
• Brief Introduction of Deep Learning
• https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Convolutional Neural Network
• https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Recurrent Neural Network (Part I)
• https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49
• Recurrent Neural Network (Part II)
• https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25
Basic Structure:Fully Connected Layer
Fully Connected Layer
……
nodeslN
Layer l
1
1
la
1
2
la
1l
ja……
……
Layer 1l
nodes1lN
la1
la2
l
ia……
l
ia
Output of a neuron:
Neuron i
Layer l
Output of one layer:
la : a vector
1
2
j
1
2
i
Fully Connected Layer
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
l
ijw
1
2
j
1
2
i
l
ijwLayer 1lto Layer l
from neuron j
to neuron i
(Layer )1l(Layer )l
ll
ll
l ww
ww
W 2221
1211
lN
1N l
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
Fully Connected Layer
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
l
ib : bias for neuron i at layer l
l
ib
1
lb1
lb2
l
i
l
l
l
b
b
b
b
2
1
bias for all neurons in layer l
Fully Connected Layer
……
nodeslN
Layer l
1
1
la
1
2
la
1l
ja……
Layer 1l
nodes1lN
l
ia……
1
2
j i
l
iz
l
i
N
j
l
j
l
ij
l
i bawzl
1
1
1
: input of the activation function for neuron i at layer l
l
ijw
l
iw 2
l
iw 1 l
i
ll
i
ll
i
l
i bawawz 1
22
1
11
l
iz
l
ib
1
: input of the activation function all the neurons in layer l
lz
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
l
i
l
l
l
i
l
l
ll
ll
l
i
l
l
b
b
b
a
a
a
ww
ww
z
z
z
2
1
1
1
2
1
1
2221
12112
1
llll baWz 1
llllll bawawz 1
1
212
1
1111 llllll bawawz 2
1
222
1
1212
l
i
ll
i
ll
i
l
i bawawz 1
22
1
11
……
Relations between Layer Outputs
l
i
l
l
l
i
l
l
z
z
z
a
a
a
2
1
2
1
l
i
l
i za
ll za
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz
Relations between Layer Outputs
……
nodeslN
Layer l
……
……
Layer 1l
nodes1lN
……
1
2
j
1
2
i
1
1
la
1
2
la
1l
ja
la1
la2
l
ia
lz1
lz2
l
iz
lalz1la
llll baWz 1
ll za
llll baWa 1
Basic Structure:Recurrent Structure
Simplify the networkby using the same function again and again
Reference
https://www.cs.toronto.edu/~graves/preprint.pdf
K. Greff, R. K. Srivastava, J. Koutník, B. R.
Steunebrink, J. Schmidhuber, "LSTM: A Search
Space Odyssey," in IEEE Transactions on
Neural Networks and Learning Systems, 2016
Rafal Józefowicz, Wojciech Zaremba, Ilya
Sutskever, “An Empirical Exploration of
Recurrent Network Architectures,”
in ICML, 2015
Recurrent Neural Network
• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥
fh0 h1
y1
x1
f h2
y2
x2
f h3
y3
x3
……
No matter how long the input/output sequence is, we only need one function f
h and h’ are vectors with the same dimension
Deep RNN
f1h0 h1
y1
x1
f1 h2
y2
x2
f1 h3
y3
x3
……
f2b0 b1
c1
f2 b2
c2
f2 b3
c3
……
… ……
ℎ′, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑦 …
f1h0 h1
a1
x1
f1 h2
a2
x2
f1 h3
a3
x3
f2b0 b1 f2 b2 f2 b3
Bidirectional RNN
x1 x2 x3
c1 c2 c3
f3 f3 f3y1 y2 y3
ℎ′, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑥
𝑦 = 𝑓3 𝑎, 𝑐
Pyramidal RNN
• Reducing the number of time steps
W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition,” ICASSP, 2016
Naïve RNN
• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥
fh h'
y
x
Ignore bias here
h'
y Wo
Wh
h'= 𝜎
softmax
= 𝜎 h Wi x+
LSTM
c change slowly
h change faster
ct is ct-1 added by something
ht and ht-1 can be very different
Naive ht
yt
xt
ht-1
LSTM
yt
xt
ct
htht-1
ct-1
xt
zzizf zo
ht-1
ct-1
zxt
ht-1W= 𝑡𝑎𝑛ℎ
zixt
ht-1Wi= 𝜎
zfxt
ht-1Wf= 𝜎
zoxt
ht-1Wo= 𝜎
xt
zzizf zo
ht-1
ct-1
“peephole”
z W= 𝑡𝑎𝑛ℎ
xt
ht-1
ct-1
diagonal
zizfzo obtained by the same way
ht
𝑐𝑡 = 𝑧𝑓⨀𝑐𝑡−1+𝑧𝑖⨀𝑧
xt
zzizf zo
+
yt
ht-1
ct-1 ct
ℎ𝑡 = 𝑧𝑜⨀𝑡𝑎𝑛ℎ 𝑐𝑡⨀
⨀ ⨀tanh
𝑦𝑡 = 𝜎 𝑊′ℎ𝑡
LSTM
xt
zzizf zo
⨀ +
yt
ht-1
ct-1 ct
xt+1
zzizf zo
+
yt+1
ht
ct+1
⨀
⨀ ⨀
⨀
⨀tanh tanh
ht
ht-1
GRU
r z
yt
xtht-1
h'
⨀
xt
⨀
⨀1-
+ ht
reset update
ℎ𝑡 = 𝑧⨀ℎ𝑡 + 1 − 𝑧 ⨀ℎ′
Example Task
• (Simplified) Speech Recognition: Frame classification on TIMIT
TSI TSI TSI I I N N N
x1 x2 x3 x4
y1 y2 y3 y4 ……
……
S S @ @ @ @
x1 x2 x3 x4
y1 y2 y3 y4 ……
……
Utterance 1 Utterance 2
Target Delay
• Only for unidirectional RNN
TSI TSI TSI I I N N NTrue labels:
Delay 3 steps: TSI TSI TSI I I N N Nx x x
LSTM > RNN > feedforward
Bi-direction > uni-direction
Forward direction
Reverse direction
Training LSTM is faster than RNN
LSTM: A Search Space Odyssey
Standard LSTM works wellSimply LSTM: coupling input and forget gate, removing peepholeForget gate is critical for performanceOutput gate activation function is critical
An Empirical Exploration of Recurrent Network
Architectures
Importance: forget > input > output
Large bias for forget gate is helpful
LSTM-f/i/o: removing forget/input/output gates
LSTM-b: large bias
An Empirical Exploration of Recurrent Network
Architectures
Stack RNN
Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015
stack
xt
yt
……
f
Push, Pop, Nothing0.7 0.2 0.1
Information to store
Pop NothingPush
X0.7 X0.2 X0.1
+ +… … …
Basic Structure:Convolutional / Pooing
Layer
Simplify the neural network(based on prior knowledge of the task)
Convolutional Layer
……
11
22
Sparse Connectivity
3
4
5
Each neural only connects to part of the output of the previous layer
3
4
Receptive Field
Different neurons have
different, but overlapping,
receptive fields
Convolutional Layer
……
11
22
Sparse Connectivity
3
4
5
Each neural only connects to part of the output of the previous layer
3
4
Parameter Sharing
The neurons with different receptive fields can use the same set of parameters.
Less parameters then fully connected layer
Convolutional Layer
……
11
22
3
4
5
3
4
Considering neuron 1 and 3 as “filter 1” (kernel 1)
filter (kernel) size: size of the receptive field of a neuron
Stride = 2
Considering neuron 2 and 4 as “filter 2” (kernel 2)
Kernel size, no. of filter, stride are all designed by the developers.
Example –1D Signal + Single Channel
1 2 3 4
𝑥1 𝑥2 𝑥3 𝑥4𝑥5
Classification, Predict the future …
Audio Signal, Stock Value …
Example –1D Signal + Multiple Channel
1
2
3
A document: each word is a vector
I
like
this
movie
very
much4
……
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5
𝑥6 𝑥7
Does this kind of receptive field make sense?
Example –2D Signal + Single Channel
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 black & white picture image
1:
2:
3:
…
7:
8:
9:…
13:
14:
15:
…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1Only show 1 filter here
Size of Receptive field is 3x3, Stride is 1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Example –2D Signal + Multiple Channel
6 x 6 colorful image
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:
…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
Only show 1 filter here
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Size of Receptive field is 3x3x3, Stride is 1
1
0
1
0
0
1
1
0
1
1
0
1
0
0
0
0
0
0
0
0
1
1
1
0
Zero Padding
Without Zero Padding
Pooling Layer
nodeskN /
Layer lLayer 1l
nodesN
…
…
1
1
k
…
1k
k2
…
2
k outputs in layer 𝑙 − 1 are grouped together
Each output in layer 𝑙“summarizes” k inputs1
1
la
1l
kala1
k
j
l
j
l ak
a1
1
1
1
Average Pooling:
Max Pooling:
L2 Pooling:
11
2
1
11 ,,,max l
k
lll aaaa
k
j
l
j
l ak
a1
21
1
1
Pooling LayerWhich outputs should be grouped together?
Group the neurons corresponding to the same filter with nearby receptive fields
……
11
22
3
4
5
3
4
Convolutional Layer
PoolingLayer
1
2
Subsampling
Pooling LayerWhich outputs should be grouped together?
Group the neurons with the same receptive field
……
11
22
3
4
5
3
4
Convolutional Layer
PoolingLayer
1
2
Maxout Network
How can you know whether the neurons detect the same pattern?
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,” In INTERPSEECH 2015
Combination of Different Basic Layers
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015
Combination of Different Basic Layers
Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015
Combination of Different Basic Layers
3 layers
Next Time
• 3/10: TAs will teach TensorFlow
• TensorFlow for regression
• TensorFlow for word vector
• word vector: https://www.youtube.com/watch?v=X7PH3NuYW0Q
• TensorFlow for CNN
• If you want to learn Theano
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html
• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20RNN.ecm.mp4/index.html