Basic Structure: Fully Connected Feedforward...

Network StructureHung-yi Lee

李宏毅

Step 1: Neural Network

Step 2: Cost Function

Step 3: Optimization

Three Steps for Deep Learning

Step 1. A neural network is a function composed of simple functions (neurons)

➢ Usually we design the network structure, and let machine find parameters from data

Step 2. Cost function evaluates how good a set of parameters is

Step 3. Find the best function set (e.g. gradient descent)

➢ We design the cost function based on the task

Outline

• Basic structure (3/03)

• Fully Connected Layer

• Recurrent Structure

• Convolutional/Pooling Layer

• Special Structure (3/17)

• Spatial Transformation Layer

• Highway Network / Grid LSTM

• Recursive Structure

• Batch Normalization

• Sequence-to-sequence / Attention (3/24)

Prerequisite

• Brief Introduction of Deep Learning

• https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

• Convolutional Neural Network

• https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

• Recurrent Neural Network (Part I)

• https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

• Recurrent Neural Network (Part II)

• https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25

https://youtu.be/Dr-WRlEFefw?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

https://youtu.be/FrKWiRv254g?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

https://youtu.be/xCGidAeyS4M?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

https://www.youtube.com/watch?v=rTqmWlnwz_0&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&index=25

Basic Structure:Fully Connected Layer

Fully Connected Layer

……

nodeslN

Layer l

1

1

la

1

2

la

1l

ja……

……

Layer 1l

nodes1lN

la1

la2

l

ia……

l

ia

Output of a neuron:

Neuron i

Layer l

Output of one layer:

la : a vector

1

2

j

1

2

i


……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

l

ijw

1

2

j

1

2

i

l

ijwLayer 1lto Layer l

from neuron j

to neuron i

(Layer )1l(Layer )l

ll

ll

l ww

ww

W 2221

1211

lN

1N l

1

1

la

1

2

la

1l

ja

la1

la2

l

ia


……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

l

ib : bias for neuron i at layer l

l

ib

1

lb1

lb2

l

i

l

l

l

b

b

b

b

2

1

bias for all neurons in layer l


……

nodeslN

Layer l

1

1

la

1

2

la

1l

ja……

Layer 1l

nodes1lN

l

ia……

1

2

j i

l

iz

l

i

N

j

l

j

l

ij

l

i bawzl

1

1

1

: input of the activation function for neuron i at layer l

l

ijw

l

iw 2

l

iw 1 l

i

ll

i

ll

i

l

i bawawz 1

22

1

11

l

iz

l

ib

1

: input of the activation function all the neurons in layer l

lz

Relations between Layer Outputs

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la


……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

l

i

l

l

l

i

l

l

ll

ll

l

i

l

l

b

b

b

a

a

a

ww

ww

z

z

z

2

1

1

1

2

1

1

2221

12112

1

llll baWz 1

llllll bawawz 1

1

212

1

1111 llllll bawawz 2

1

222

1

1212

l

i

ll

i

ll

i

l

i bawawz 1

22

1

11

……


l

i

l

l

l

i

l

l

z

z

z

a

a

a

2

1

2

1

l

i

l

i za

ll za

……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz


……

nodeslN

Layer l

……

……

Layer 1l

nodes1lN

……

1

2

j

1

2

i

1

1

la

1

2

la

1l

ja

la1

la2

l

ia

lz1

lz2

l

iz

lalz1la

llll baWz 1

ll za

llll baWa 1

Basic Structure:Recurrent Structure

Simplify the networkby using the same function again and again

Reference

https://www.cs.toronto.edu/~graves/preprint.pdf

K. Greff, R. K. Srivastava, J. Koutník, B. R.

Steunebrink, J. Schmidhuber, "LSTM: A Search

Space Odyssey," in IEEE Transactions on

Neural Networks and Learning Systems, 2016

Rafal Józefowicz, Wojciech Zaremba, Ilya

Sutskever, “An Empirical Exploration of

Recurrent Network Architectures,”

in ICML, 2015

Recurrent Neural Network

• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥

fh0 h1

y1

x1

f h2

y2

x2

f h3

y3

x3

……

No matter how long the input/output sequence is, we only need one function f

h and h’ are vectors with the same dimension

Deep RNN

f1h0 h1

y1

x1

f1 h2

y2

x2

f1 h3

y3

x3

……

f2b0 b1

c1

f2 b2

c2

f2 b3

c3

……

… ……

ℎ′, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑦 …

f1h0 h1

a1

x1

f1 h2

a2

x2

f1 h3

a3

x3

f2b0 b1 f2 b2 f2 b3

Bidirectional RNN

x1 x2 x3

c1 c2 c3

f3 f3 f3y1 y2 y3

ℎ′, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑥

𝑦 = 𝑓3 𝑎, 𝑐

Pyramidal RNN

• Reducing the number of time steps

W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural

network for large vocabulary conversational speech recognition,” ICASSP, 2016

Naïve RNN

• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥

fh h'

y

x

Ignore bias here

h'

y Wo

Wh

h'= 𝜎

softmax

= 𝜎 h Wi x+

LSTM

c change slowly

h change faster

ct is ct-1 added by something

ht and ht-1 can be very different

Naive ht

yt

xt

ht-1

LSTM

yt

xt

ct

htht-1

ct-1

xt

zzizf zo

ht-1

ct-1

zxt

ht-1W= 𝑡𝑎𝑛ℎ

zixt

ht-1Wi= 𝜎

zfxt

ht-1Wf= 𝜎

zoxt

ht-1Wo= 𝜎

xt

zzizf zo

ht-1

ct-1

“peephole”

z W= 𝑡𝑎𝑛ℎ

xt

ht-1

ct-1

diagonal

zizfzo obtained by the same way

ht

𝑐𝑡 = 𝑧𝑓⨀𝑐𝑡−1+𝑧𝑖⨀𝑧

xt

zzizf zo

＋

yt

ht-1

ct-1 ct

ℎ𝑡 = 𝑧𝑜⨀𝑡𝑎𝑛ℎ 𝑐𝑡⨀

⨀ ⨀tanh

𝑦𝑡 = 𝜎 𝑊′ℎ𝑡

LSTM

xt

zzizf zo

⨀ ＋

yt

ht-1

ct-1 ct

xt+1

zzizf zo

＋

yt+1

ht

ct+1

⨀

⨀ ⨀

⨀

⨀tanh tanh

ht

ht-1

GRU

r z

yt

xtht-1

h'

⨀

xt

⨀

⨀1-

＋ ht

reset update

ℎ𝑡 = 𝑧⨀ℎ𝑡 + 1 − 𝑧 ⨀ℎ′

Example Task

• (Simplified) Speech Recognition: Frame classification on TIMIT

TSI TSI TSI I I N N N

x1 x2 x3 x4

y1 y2 y3 y4 ……

……

S S @ @ @ @

x1 x2 x3 x4

y1 y2 y3 y4 ……

……

Utterance 1 Utterance 2

Target Delay

• Only for unidirectional RNN

TSI TSI TSI I I N N NTrue labels:

Delay 3 steps: TSI TSI TSI I I N N Nx x x

LSTM > RNN > feedforward

Bi-direction > uni-direction

Forward direction

Reverse direction

Training LSTM is faster than RNN

LSTM: A Search Space Odyssey

Standard LSTM works wellSimply LSTM: coupling input and forget gate, removing peepholeForget gate is critical for performanceOutput gate activation function is critical

An Empirical Exploration of Recurrent Network

Architectures

Importance: forget > input > output

Large bias for forget gate is helpful

LSTM-f/i/o: removing forget/input/output gates

LSTM-b: large bias

An Empirical Exploration of Recurrent Network

Architectures

Stack RNN

Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015

stack

xt

yt

……

f

Push, Pop, Nothing0.7 0.2 0.1

Information to store

Pop NothingPush

X0.7 X0.2 X0.1

+ +… … …

Basic Structure:Convolutional / Pooing

Layer

Simplify the neural network(based on prior knowledge of the task)

Convolutional Layer

……

11

22

Sparse Connectivity

3

4

5

Each neural only connects to part of the output of the previous layer

3

4

Receptive Field

Different neurons have

different, but overlapping,

receptive fields

Convolutional Layer

……

11

22

Sparse Connectivity

3

4

5

Each neural only connects to part of the output of the previous layer

3

4

Parameter Sharing

The neurons with different receptive fields can use the same set of parameters.

Less parameters then fully connected layer

Convolutional Layer

……

11

22

3

4

5

3

4

Considering neuron 1 and 3 as “filter 1” (kernel 1)

filter (kernel) size: size of the receptive field of a neuron

Stride = 2

Considering neuron 2 and 4 as “filter 2” (kernel 2)

Kernel size, no. of filter, stride are all designed by the developers.

Example –1D Signal + Single Channel

1 2 3 4

𝑥1 𝑥2 𝑥3 𝑥4𝑥5

Classification, Predict the future …

Audio Signal, Stock Value …

Example –1D Signal + Multiple Channel

1

2

3

A document: each word is a vector

I

like

this

movie

very

much4

……

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5

𝑥6 𝑥7

Does this kind of receptive field make sense?

Example –2D Signal + Single Channel

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

6 x 6 black & white picture image

1:

2:

3:

…

7:

8:

9:…

13:

14:

15:

…

4:

10:

16:

1

0

0

0

0

1

0

0

0

0

1

1Only show 1 filter here

Size of Receptive field is 3x3, Stride is 1

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

Example –2D Signal + Multiple Channel

6 x 6 colorful image

1:

2:

3:

…

7:

8:

9:

…

13:

14:

15:

…

4:

10:

16:

1

0

0

0

0

1

0

0

0

0

1

1

Only show 1 filter here

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0

1 0 0 0 1 0

0 1 0 0 1 0

0 0 1 0 1 0

Size of Receptive field is 3x3x3, Stride is 1

1

0

1

0

0

1

1

0

1

1

0

1

0

0

0

0

0

0

0

0

1

1

1

0

Zero Padding

Without Zero Padding

Pooling Layer

nodeskN /

Layer lLayer 1l

nodesN

…

…

1

1

k

…

1k

k2

…

2

k outputs in layer 𝑙 − 1 are grouped together

Each output in layer 𝑙“summarizes” k inputs1

1

la

1l

kala1

k

j

l

j

l ak

a1

1

1

1

Average Pooling:

Max Pooling:

L2 Pooling:

11

2

1

11 ,,,max l

k

lll aaaa

k

j

l

j

l ak

a1

21

1

1

Pooling LayerWhich outputs should be grouped together?

Group the neurons corresponding to the same filter with nearby receptive fields

……

11

22

3

4

5

3

4

Convolutional Layer

PoolingLayer

1

2

Subsampling

Pooling LayerWhich outputs should be grouped together?

Group the neurons with the same receptive field

……

11

22

3

4

5

3

4

Convolutional Layer

PoolingLayer

1

2

Maxout Network

How can you know whether the neurons detect the same pattern?

Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,” In INTERPSEECH 2015

Combination of Different Basic Layers

Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015


Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-endWith Raw Waveform CLDNNs,” In INTERPSEECH 2015


3 layers

Next Time

• 3/10: TAs will teach TensorFlow

• TensorFlow for regression

• TensorFlow for word vector

• word vector: https://www.youtube.com/watch?v=X7PH3NuYW0Q

• TensorFlow for CNN

• If you want to learn Theano

• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html

• http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20RNN.ecm.mp4/index.html

https://www.youtube.com/watch?v=X7PH3NuYW0Q

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano DNN.ecm.mp4/index.html

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano RNN.ecm.mp4/index.html

Basic Structure: Fully Connected Feedforward...

Documents

Transcript of Basic Structure: Fully Connected Feedforward...