Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training...

63
Recurrent Neural Networks Deep Learning – Lecture 5 Efstratios Gavves

Transcript of Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training...

Page 1: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Recurrent Neural NetworksDeep Learning – Lecture 5

Efstratios Gavves

Page 2: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Sequential Data

So far, all tasks assumed stationary data

Neither all data, nor all tasks are stationary though

Page 3: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Sequential Data: Text

What

Page 4: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Sequential Data: Text

What about

Page 5: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Sequential Data: Text

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Page 6: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Memory needed

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Pr π‘₯ =ΰ·‘

𝑖

Pr π‘₯𝑖 π‘₯1, … , π‘₯π‘–βˆ’1)

Page 7: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Sequential data: Video

Page 8: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Quiz: Other sequential data?

Page 9: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Quiz: Other sequential data?

Time series data

Stock exchange

Biological measurements

Climate measurements

Market analysis

Speech/Music

User behavior in websites

...

Page 11: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Machine Translation

The phrase in the source language is one sequence

– β€œToday the weather is good”

The phrase in the target language is also a sequence

– β€œΠŸΠΎΠ³ΠΎΠ΄Π° сСгодня Ρ…ΠΎΡ€ΠΎΡˆΠ°Ρβ€

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good <EOS> Погода сСгодня Ρ…ΠΎΡ€ΠΎΡˆΠ°Ρ

Погода сСгодня <EOS>

Encoder Decoder

Ρ…ΠΎΡ€ΠΎΡˆΠ°Ρ

Page 12: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Image captioning

An image is a thousand words, literally!

Pretty much the same as machine transation

Replace the encoder part with the output of a Convnet

– E.g. use Alexnet or a VGG16 network

Keep the decoder part to operate as a translator

LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good

Today the weather is good <EOS>

Convnet

Page 13: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Demo

Click to go to the video in Youtube

Page 14: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Question answering

Bleeding-edge research, no real consensus– Very interesting open, research problems

Again, pretty much like machine translation

Again, Encoder-Decoder paradigm– Insert the question to the encoder part

– Model the answer at the decoder part

Question answering with images also– Again, bleeding-edge research

– How/where to add the image?

– What has been working so far is to add the image

only in the beginningQ: what are the people playing?

A: They play beach football

Q: John entered the living room, where he met Mary. She was drinking some wine and watching

a movie. What room did John enter?A: John entered the living room.

Page 15: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Demo

Click to go to the website

Page 16: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Handwriting

Click to go to the website

Page 17: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Recurrent Networks

Recurrentconnections

NN CellState

Input

Output

Page 18: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Sequences

Next data depend on previous data

Roughly equivalent to predicting what comes next

Pr π‘₯ =ΰ·‘

𝑖

Pr π‘₯𝑖 π‘₯1, … , π‘₯π‘–βˆ’1)

What about inputs that appear in

sequences,suc

h as text?Could neural

network handle such modalities?

a

Page 19: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Why sequences?

Parameters are reused Fewer parameters Easier modelling

Generalizes well to arbitrary lengths Not constrained to specific length

However, often we pick a β€œframe” 𝑇 instead of an arbitrary length

Pr π‘₯ =ΰ·‘

𝑖

Pr π‘₯𝑖 π‘₯π‘–βˆ’π‘‡ , … , π‘₯π‘–βˆ’1)

I think, therefore, I am!

Everything is repeated, in a circle. History is a master because it

teaches us that it doesn't exist. It's the permutations that matter.

π‘…π‘’π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘π‘€π‘œπ‘‘π‘’π‘™(

π‘…π‘’π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘π‘€π‘œπ‘‘π‘’π‘™(

)

)

≑

Page 20: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Quiz: What is really a sequence?

Data inside a sequence are … ?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Page 21: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Quiz: What is really a sequence?

Data inside a sequence are non i.i.d.

– Identically, independently distributed

The next β€œword” depends on the previous β€œwords”

– Ideally on all of them

We need context, and we need memory!

How to model context and memory ?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Bond

Page 22: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

π‘₯𝑖 ≑ One-hot vectors

A vector with all zeros except for the active dimension

12 words in a sequence 12 One-hot vectors

After the one-hot vectors apply an embedding

Word2Vec, GloVE

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0I

am

Bond

,

James

McGuire

tired

!

tired

Vocabulary

0 0 0 0

One-hot vectors

Page 23: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Quiz: Why not just indices?

I am James McGuire

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

I am James McGuire

One-hot representation Index representation

π‘₯𝑑=1,2,3,4 =

OR?

π‘₯"𝐼" = 1

π‘₯"π‘Žπ‘š" = 2

π‘₯"π½π‘Žπ‘šπ‘’π‘ " = 4

π‘₯"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’" = 7

Page 24: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Quiz: Why not just indices?

No, because then some words get closer together for no

good reason (artificial bias between words)

I am James McGuire

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

I am James McGuire

One-hot representation Index representation

π‘₯"𝐼"

OR?

π‘ž"𝐼" = 1

π‘ž"π‘Žπ‘š" = 2

π‘ž"π½π‘Žπ‘šπ‘’π‘ " = 4

π‘ž"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’" = 7

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘ž"π‘Žπ‘š", π‘ž"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’") = 5

π‘₯"π‘Žπ‘š"

π‘₯"π½π‘Žπ‘šπ‘’π‘ "

π‘₯"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’"

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘ž"𝐼", π‘ž"π‘Žπ‘š") = 1

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘₯"π‘Žπ‘š", π‘₯"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’") = 1

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘₯"𝐼", π‘₯"π‘Žπ‘š") = 1

Page 25: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Memory

A representation of the past

A memory must project information at timestep 𝑑 on a

latent space 𝑐𝑑 using parameters πœƒ

Then, re-use the projected information from 𝑑 at 𝑑 + 1𝑐𝑑+1 = β„Ž(π‘₯𝑑+1, 𝑐𝑑; πœƒ)

Memory parameters πœƒ are shared for all timesteps 𝑑 = 0,…𝑐𝑑+1 = β„Ž(π‘₯𝑑+1, β„Ž(π‘₯𝑑 , β„Ž(π‘₯π‘‘βˆ’1, … β„Ž π‘₯1, 𝑐0; πœƒ ; πœƒ); πœƒ); πœƒ)

Page 26: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Memory as a Graph

Simplest model

– Input with parameters π‘ˆ

– Memory embedding with parameters π‘Š

– Output with parameters 𝑉

π‘₯𝑑

𝑦𝑑

π‘ˆ

𝑉

π‘Š

Output parameters

Input parameters

Memory

parameters

Input

Output

𝑐𝑑Memory

Page 27: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Memory as a Graph

Simplest model

– Input with parameters π‘ˆ

– Memory embedding with parameters π‘Š

– Output with parameters 𝑉

π‘₯𝑑

𝑦𝑑

π‘₯𝑑+1

𝑦𝑑+1

π‘ˆ

𝑉

π‘Š

Output parameters

Input parameters Input

Output

π‘ˆ

𝑉

π‘ŠMemory

parameters𝑐𝑑 𝑐𝑑+1

Memory

Page 28: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Memory as a Graph

Simplest model

– Input with parameters π‘ˆ

– Memory embedding with parameters π‘Š

– Output with parameters 𝑉

π‘₯𝑑

𝑦𝑑

π‘₯𝑑+1 π‘₯𝑑+2 π‘₯𝑑+3

𝑦𝑑+1 𝑦𝑑+2 𝑦𝑑+3

π‘ˆ

𝑉

π‘Šπ‘π‘‘ 𝑐𝑑+1 𝑐𝑑+2 𝑐𝑑+3

Output parameters

Input parameters π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

Memory

Input

Output

π‘ˆ

𝑉

π‘ŠMemory

parameters

Page 29: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Folding the memory

π‘₯𝑑

𝑦𝑑

π‘₯𝑑+1 π‘₯𝑑+2

𝑦𝑑+1 𝑦𝑑+2

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑐𝑑 𝑐𝑑+1 𝑐𝑑+2

π‘ˆ

𝑉

π‘Š

π‘₯𝑑

𝑦𝑑

π‘Šπ‘‰

π‘ˆ

𝑐𝑑(π‘π‘‘βˆ’1)

Unrolled/Unfolded Network Folded Network

π‘Š

Page 30: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Recurrent Neural Network (RNN)

Only two equations

π‘₯𝑑

𝑦𝑑

π‘Šπ‘‰

π‘ˆ

𝑐𝑑

(π‘π‘‘βˆ’1)

𝑐𝑑 = tanh(π‘ˆ π‘₯𝑑 +π‘Šπ‘π‘‘βˆ’1)

𝑦𝑑 = softmax(𝑉 𝑐𝑑)

Page 31: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

RNN Example

Vocabulary of 5 words

A memory of 3 units

– Hyperparameter that we choose like layer size

– 𝑐𝑑: 3 Γ— 1 ,W: [3 Γ— 3]

An input projection of 3 dimensions

– U: [3 Γ— 5]

An output projections of 10 dimensions

– V: [10 Γ— 3]

𝑐𝑑 = tanh(π‘ˆ π‘₯𝑑 +π‘Šπ‘π‘‘βˆ’1)

𝑦𝑑 = softmax(𝑉 𝑐𝑑)

π‘ˆ β‹… π‘₯𝑑=4 =

Page 32: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

𝑦1 𝑦2 𝑦3

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

Final output

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š1 π‘Š2 π‘Š3𝑦

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3

π‘Šπ‘₯ π‘₯

Page 33: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

𝑦1 𝑦2 𝑦3

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3

Final output

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š1 π‘Š2π‘Š π‘Š3π‘₯ π‘₯

𝑦

Page 34: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

β€’ Sometimes intermediate outputs are not even neededβ€’ Removing them, we almost end up to a standard

Neural Network

𝑦1 𝑦2 𝑦3

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3

Final output

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š1 π‘Š2π‘Š π‘Š3π‘₯ π‘₯

𝑦

Page 35: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

𝑦

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3 Final output

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

𝑉

π‘Š2π‘Š π‘Š3π‘₯ π‘₯

π‘¦π‘Š1

Page 36: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Training Recurrent Networks

Cross-entropy loss

𝑃 =ΰ·‘

𝑑,π‘˜

π‘¦π‘‘π‘˜π‘™π‘‘π‘˜ β‡’ β„’ = βˆ’ log 𝑃 =

𝑑

ℒ𝑑 =βˆ’1

𝑇

𝑑

𝑙𝑑 log 𝑦𝑑

Backpropagation Through Time (BPTT)

– Again, chain rule

– Only difference: Gradients survive over time steps

Page 37: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Training RNNs is hard

Vanishing gradients

– After a few time steps the gradients become almost 0

Exploding gradients

– After a few time steps the gradients become huge

Can’t capture long-term dependencies

Page 38: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Alternative formulation for RNNs

An alternative formulation to derive conclusions and

intuitions

𝑐𝑑 = π‘Š β‹… tanh(π‘π‘‘βˆ’1) + π‘ˆ β‹… π‘₯𝑑 + 𝑏

β„’ =

𝑑

ℒ𝑑(𝑐𝑑)

Page 39: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Another look at the gradients

β„’ = 𝐿(𝑐𝑇(π‘π‘‡βˆ’1 … 𝑐1 π‘₯1, 𝑐0;π‘Š ;π‘Š ;π‘Š ;π‘Š)

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•β„’

πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

=πœ•β„’

πœ•ctβ‹…πœ•π‘π‘‘πœ•ctβˆ’1

β‹…πœ•π‘π‘‘βˆ’1πœ•ctβˆ’2

β‹… … β‹…πœ•π‘πœ+1πœ•cΟ„

≀ πœ‚π‘‘βˆ’πœπœ•β„’π‘‘πœ•π‘π‘‘

The RNN gradient is a recursive product of πœ•π‘π‘‘

πœ•ctβˆ’1

𝑅𝑒𝑠𝑑 β†’ short-term factors 𝑑 ≫ 𝜏 β†’ long-term factors

Page 40: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Exploding/Vanishing gradients

πœ•β„’

πœ•π‘π‘‘=

πœ•β„’

πœ•cTβ‹…πœ•π‘π‘‡πœ•cTβˆ’1

β‹…πœ•π‘π‘‡βˆ’1πœ•cTβˆ’2

β‹… … β‹…πœ•π‘π‘‘+1πœ•c𝑐𝑑

πœ•β„’

πœ•π‘π‘‘=

πœ•β„’

πœ•cTβ‹…πœ•π‘π‘‡πœ•cTβˆ’1

β‹…πœ•π‘π‘‡βˆ’1πœ•cTβˆ’2

β‹… … β‹…πœ•π‘1πœ•c𝑐𝑑

< 1 < 1 < 1

πœ•β„’

πœ•π‘Šβ‰ͺ 1⟹ Vanishing

gradient

> 1 > 1 > 1

πœ•β„’

πœ•π‘Šβ‰« 1⟹ Exploding

gradient

Page 41: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Vanishing gradients

The gradient of the error w.r.t. to intermediate cell

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

π‘Š β‹… πœ• tanh(π‘π‘˜βˆ’1)

Page 42: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Vanishing gradients

The gradient of the error w.r.t. to intermediate cell

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

π‘Š β‹… πœ• tanh(π‘π‘˜βˆ’1)

Long-term dependencies get exponentially smaller weights

Page 43: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Gradient clipping for exploding gradients

Scale the gradients to a threshold

1. g β†πœ•β„’

πœ•π‘Š

2. if 𝑔 > πœƒ0:

g β†πœƒ0

𝑔𝑔

else:

print(β€˜Do nothing’)

Simple, but works!

πœƒ0

g

πœƒ0𝑔

𝑔

Pseudocode

Page 44: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Rescaling vanishing gradients?

Not good solution

Weights are shared between timesteps Loss summed

over timesteps

β„’ =

𝑑

ℒ𝑑 βŸΉπœ•β„’

πœ•π‘Š=

𝑑

πœ•β„’π‘‘πœ•π‘Š

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

=

𝜏=1

π‘‘πœ•β„’π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

Rescaling for one timestep (πœ•β„’π‘‘

πœ•π‘Š) affects all timesteps

– The rescaling factor for one timestep does not work for another

Page 45: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

More intuitively

πœ•β„’1πœ•π‘Š

πœ•β„’2πœ•π‘Š

πœ•β„’3πœ•π‘Š

πœ•β„’4πœ•π‘Š

πœ•β„’5πœ•π‘Š

πœ•β„’

πœ•π‘Š=

πœ•β„’1

πœ•π‘Š+

πœ•β„’2

πœ•π‘Š+πœ•β„’3

πœ•π‘Š+

πœ•β„’4

πœ•π‘Š+

πœ•β„’5

πœ•π‘Š

Page 46: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

πœ•β„’1πœ•π‘Š

πœ•β„’4πœ•π‘Š

πœ•β„’2πœ•π‘Š πœ•β„’3

πœ•π‘Šπœ•β„’5πœ•π‘Š

More intuitively

Let’s say πœ•β„’1

πœ•π‘Šβˆ 1,

πœ•β„’2

πœ•π‘Šβˆ 1/10,

πœ•β„’3

πœ•π‘Šβˆ 1/100,

πœ•β„’4

πœ•π‘Šβˆ 1/1000,

πœ•β„’5

πœ•π‘Šβˆ 1/10000

πœ•β„’

πœ•π‘Š=

π‘Ÿ

πœ•β„’π‘Ÿπœ•π‘Š

= 1.1111

If πœ•β„’

πœ•π‘Šrescaled to 1

πœ•β„’5

πœ•π‘Šβˆ 10βˆ’5

Longer-term dependencies negligible

– Weak recurrent modelling

– Learning focuses on the short-term only

πœ•β„’

πœ•π‘Š=

πœ•β„’1

πœ•π‘Š+

πœ•β„’2

πœ•π‘Š+πœ•β„’3

πœ•π‘Š+

πœ•β„’4

πœ•π‘Š+

πœ•β„’5

πœ•π‘Š

Page 47: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Recurrent networks ∝ Chaotic systems

Page 48: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Fixing vanishing gradients

Regularization on the recurrent weights

– Force the error signal not to vanish

Ξ© =

𝑑

Ξ©t =

𝑑

πœ•β„’πœ•π‘π‘‘+1

πœ•π‘π‘‘+1πœ•π‘π‘‘

πœ•β„’πœ•π‘π‘‘+1

βˆ’ 1

2

Advanced recurrent modules

Long-Short Term Memory module

Gated Recurrent Unit module

Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013

Page 49: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Advanced Recurrent Nets

+

𝜎𝜎𝜎

tanh

tanh

Input

Output

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 50: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Let’s have a look at the loss function

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

Page 51: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Let’s have a look at the loss function

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

Page 52: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Solution: have an activation function with gradient equal to 1

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

– Identify function has a gradient equal to 1

By doing so, gradients do not become too small not too large

Page 53: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Long Short-Term MemoryLSTM: Beefed up RNN

𝑖 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

InputOutput

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

More info at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑐𝑑 = π‘Š β‹… tanh(π‘π‘‘βˆ’1) + π‘ˆ β‹… π‘₯𝑑 + 𝑏

LSTM

Simple

RNN

- The previous state π‘π‘‘βˆ’1 is connected to new

𝑐𝑑 with no nonlinearity (identity function).

- The only other factor is the forget gate 𝑓which rescales the previous LSTM state.

Page 54: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Cell state

The cell state carries the essential information over time

𝑖 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

Cell state line

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 55: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

LSTM nonlinearities

𝜎 ∈ (0, 1): control gate – something like a switch

tanh ∈ βˆ’1, 1 : recurrent nonlinearity

𝑖 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

Input

Output

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 56: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

LSTM Step-by-Step: Step (1)

E.g. LSTM on β€œYesterday she slapped me. Today she loves me.”

Decide what to forget and what to remember for the new memory

– Sigmoid 1 Remember everything

– Sigmoid 0 Forget everything

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 57: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

LSTM Step-by-Step: Step (2)

Decide what new information is relevant from the new input

and should be add to the new memory

– Modulate the input 𝑖𝑑– Generate candidate memories 𝑐𝑑

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 58: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

LSTM Step-by-Step: Step (3)Compute and update the current cell state 𝑐𝑑

– Depends on the previous cell state

– What we decide to forget

– What inputs we allow

– The candidate memories

+

𝜎𝜎𝜎

tanh

tanh

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 59: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

LSTM Step-by-Step: Step (4)

Modulate the output

– Does the new cell state relevant? Sigmoid 1

– If not Sigmoid 0

Generate the new memory

+

𝜎𝜎𝜎

tanh

tanh

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

Page 60: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

LSTM Unrolled Network

Macroscopically very similar to standard RNNs

The engine is a bit different (more complicated)

– Because of their gates LSTMs capture long and short term

dependencies

Γ— +

𝜎𝜎𝜎

Γ—

tanh

Γ—

tanh

Γ— +

𝜎𝜎𝜎

Γ—

tanh

Γ—

tanh

Γ— +

𝜎𝜎𝜎

Γ—

tanh

Γ—

tanh

Page 61: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Beyond RNN & LSTM

LSTM with peephole connections

– Gates have access also to the previous cell states π‘π‘‘βˆ’1 (not only memories)

– Coupled forget and input gates, 𝑐𝑑 = 𝑓𝑑 βŠ™ π‘π‘‘βˆ’1 + 1 βˆ’ 𝑓𝑑 βŠ™ 𝑐𝑑– Bi-directional recurrent networks

Gated Recurrent Units (GRU)

Deep recurrent architectures

Recursive neural networks

– Tree structured

Multiplicative interactions

Generative recurrent architectures

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

Page 62: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Take-away message

Recurrent Neural Networks (RNN) for sequences

Backpropagation Through Time

Vanishing and Exploding Gradients and Remedies

RNNs using Long Short-Term Memory (LSTM)

Applications of Recurrent Neural Networks

Page 63: Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training Recurrent Networks Cross-entropy loss 𝑃=ΰ·‘ , U π‘‘π‘˜β‡’ β„’=βˆ’log𝑃= β„’

Thank you!