Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training...

Post on 27-Mar-2020

15 views 0 download

Transcript of Recurrent Neural Networks - GitHub Pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfΒ Β· Training...

Recurrent Neural NetworksDeep Learning – Lecture 5

Efstratios Gavves

Sequential Data

So far, all tasks assumed stationary data

Neither all data, nor all tasks are stationary though

Sequential Data: Text

What

Sequential Data: Text

What about

Sequential Data: Text

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Memory needed

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Pr π‘₯ =ΰ·‘

𝑖

Pr π‘₯𝑖 π‘₯1, … , π‘₯π‘–βˆ’1)

Sequential data: Video

Quiz: Other sequential data?

Quiz: Other sequential data?

Time series data

Stock exchange

Biological measurements

Climate measurements

Market analysis

Speech/Music

User behavior in websites

...

Machine Translation

The phrase in the source language is one sequence

– β€œToday the weather is good”

The phrase in the target language is also a sequence

– β€œΠŸΠΎΠ³ΠΎΠ΄Π° сСгодня Ρ…ΠΎΡ€ΠΎΡˆΠ°Ρβ€

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good <EOS> Погода сСгодня Ρ…ΠΎΡ€ΠΎΡˆΠ°Ρ

Погода сСгодня <EOS>

Encoder Decoder

Ρ…ΠΎΡ€ΠΎΡˆΠ°Ρ

Image captioning

An image is a thousand words, literally!

Pretty much the same as machine transation

Replace the encoder part with the output of a Convnet

– E.g. use Alexnet or a VGG16 network

Keep the decoder part to operate as a translator

LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good

Today the weather is good <EOS>

Convnet

Demo

Click to go to the video in Youtube

Question answering

Bleeding-edge research, no real consensus– Very interesting open, research problems

Again, pretty much like machine translation

Again, Encoder-Decoder paradigm– Insert the question to the encoder part

– Model the answer at the decoder part

Question answering with images also– Again, bleeding-edge research

– How/where to add the image?

– What has been working so far is to add the image

only in the beginningQ: what are the people playing?

A: They play beach football

Q: John entered the living room, where he met Mary. She was drinking some wine and watching

a movie. What room did John enter?A: John entered the living room.

Demo

Click to go to the website

Handwriting

Click to go to the website

Recurrent Networks

Recurrentconnections

NN CellState

Input

Output

Sequences

Next data depend on previous data

Roughly equivalent to predicting what comes next

Pr π‘₯ =ΰ·‘

𝑖

Pr π‘₯𝑖 π‘₯1, … , π‘₯π‘–βˆ’1)

What about inputs that appear in

sequences,suc

h as text?Could neural

network handle such modalities?

a

Why sequences?

Parameters are reused Fewer parameters Easier modelling

Generalizes well to arbitrary lengths Not constrained to specific length

However, often we pick a β€œframe” 𝑇 instead of an arbitrary length

Pr π‘₯ =ΰ·‘

𝑖

Pr π‘₯𝑖 π‘₯π‘–βˆ’π‘‡ , … , π‘₯π‘–βˆ’1)

I think, therefore, I am!

Everything is repeated, in a circle. History is a master because it

teaches us that it doesn't exist. It's the permutations that matter.

π‘…π‘’π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘π‘€π‘œπ‘‘π‘’π‘™(

π‘…π‘’π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘π‘€π‘œπ‘‘π‘’π‘™(

)

)

≑

Quiz: What is really a sequence?

Data inside a sequence are … ?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Quiz: What is really a sequence?

Data inside a sequence are non i.i.d.

– Identically, independently distributed

The next β€œword” depends on the previous β€œwords”

– Ideally on all of them

We need context, and we need memory!

How to model context and memory ?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Bond

π‘₯𝑖 ≑ One-hot vectors

A vector with all zeros except for the active dimension

12 words in a sequence 12 One-hot vectors

After the one-hot vectors apply an embedding

Word2Vec, GloVE

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0I

am

Bond

,

James

McGuire

tired

!

tired

Vocabulary

0 0 0 0

One-hot vectors

Quiz: Why not just indices?

I am James McGuire

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

I am James McGuire

One-hot representation Index representation

π‘₯𝑑=1,2,3,4 =

OR?

π‘₯"𝐼" = 1

π‘₯"π‘Žπ‘š" = 2

π‘₯"π½π‘Žπ‘šπ‘’π‘ " = 4

π‘₯"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’" = 7

Quiz: Why not just indices?

No, because then some words get closer together for no

good reason (artificial bias between words)

I am James McGuire

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

I am James McGuire

One-hot representation Index representation

π‘₯"𝐼"

OR?

π‘ž"𝐼" = 1

π‘ž"π‘Žπ‘š" = 2

π‘ž"π½π‘Žπ‘šπ‘’π‘ " = 4

π‘ž"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’" = 7

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘ž"π‘Žπ‘š", π‘ž"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’") = 5

π‘₯"π‘Žπ‘š"

π‘₯"π½π‘Žπ‘šπ‘’π‘ "

π‘₯"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’"

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘ž"𝐼", π‘ž"π‘Žπ‘š") = 1

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘₯"π‘Žπ‘š", π‘₯"π‘€π‘πΊπ‘’π‘–π‘Ÿπ‘’") = 1

π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘₯"𝐼", π‘₯"π‘Žπ‘š") = 1

Memory

A representation of the past

A memory must project information at timestep 𝑑 on a

latent space 𝑐𝑑 using parameters πœƒ

Then, re-use the projected information from 𝑑 at 𝑑 + 1𝑐𝑑+1 = β„Ž(π‘₯𝑑+1, 𝑐𝑑; πœƒ)

Memory parameters πœƒ are shared for all timesteps 𝑑 = 0,…𝑐𝑑+1 = β„Ž(π‘₯𝑑+1, β„Ž(π‘₯𝑑 , β„Ž(π‘₯π‘‘βˆ’1, … β„Ž π‘₯1, 𝑐0; πœƒ ; πœƒ); πœƒ); πœƒ)

Memory as a Graph

Simplest model

– Input with parameters π‘ˆ

– Memory embedding with parameters π‘Š

– Output with parameters 𝑉

π‘₯𝑑

𝑦𝑑

π‘ˆ

𝑉

π‘Š

Output parameters

Input parameters

Memory

parameters

Input

Output

𝑐𝑑Memory

Memory as a Graph

Simplest model

– Input with parameters π‘ˆ

– Memory embedding with parameters π‘Š

– Output with parameters 𝑉

π‘₯𝑑

𝑦𝑑

π‘₯𝑑+1

𝑦𝑑+1

π‘ˆ

𝑉

π‘Š

Output parameters

Input parameters Input

Output

π‘ˆ

𝑉

π‘ŠMemory

parameters𝑐𝑑 𝑐𝑑+1

Memory

Memory as a Graph

Simplest model

– Input with parameters π‘ˆ

– Memory embedding with parameters π‘Š

– Output with parameters 𝑉

π‘₯𝑑

𝑦𝑑

π‘₯𝑑+1 π‘₯𝑑+2 π‘₯𝑑+3

𝑦𝑑+1 𝑦𝑑+2 𝑦𝑑+3

π‘ˆ

𝑉

π‘Šπ‘π‘‘ 𝑐𝑑+1 𝑐𝑑+2 𝑐𝑑+3

Output parameters

Input parameters π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

Memory

Input

Output

π‘ˆ

𝑉

π‘ŠMemory

parameters

Folding the memory

π‘₯𝑑

𝑦𝑑

π‘₯𝑑+1 π‘₯𝑑+2

𝑦𝑑+1 𝑦𝑑+2

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑐𝑑 𝑐𝑑+1 𝑐𝑑+2

π‘ˆ

𝑉

π‘Š

π‘₯𝑑

𝑦𝑑

π‘Šπ‘‰

π‘ˆ

𝑐𝑑(π‘π‘‘βˆ’1)

Unrolled/Unfolded Network Folded Network

π‘Š

Recurrent Neural Network (RNN)

Only two equations

π‘₯𝑑

𝑦𝑑

π‘Šπ‘‰

π‘ˆ

𝑐𝑑

(π‘π‘‘βˆ’1)

𝑐𝑑 = tanh(π‘ˆ π‘₯𝑑 +π‘Šπ‘π‘‘βˆ’1)

𝑦𝑑 = softmax(𝑉 𝑐𝑑)

RNN Example

Vocabulary of 5 words

A memory of 3 units

– Hyperparameter that we choose like layer size

– 𝑐𝑑: 3 Γ— 1 ,W: [3 Γ— 3]

An input projection of 3 dimensions

– U: [3 Γ— 5]

An output projections of 10 dimensions

– V: [10 Γ— 3]

𝑐𝑑 = tanh(π‘ˆ π‘₯𝑑 +π‘Šπ‘π‘‘βˆ’1)

𝑦𝑑 = softmax(𝑉 𝑐𝑑)

π‘ˆ β‹… π‘₯𝑑=4 =

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

𝑦1 𝑦2 𝑦3

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

Final output

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š1 π‘Š2 π‘Š3𝑦

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3

π‘Šπ‘₯ π‘₯

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

𝑦1 𝑦2 𝑦3

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3

Final output

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š1 π‘Š2π‘Š π‘Š3π‘₯ π‘₯

𝑦

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

β€’ Sometimes intermediate outputs are not even neededβ€’ Removing them, we almost end up to a standard

Neural Network

𝑦1 𝑦2 𝑦3

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3

Final output

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š

π‘ˆ

𝑉

π‘Š1 π‘Š2π‘Š π‘Š3π‘₯ π‘₯

𝑦

Rolled Network vs. MLP?

What is really different?

– Steps instead of layers

– Step parameters shared in Recurrent Network

– In a Multi-Layer Network parameters are different

𝑦

πΏπ‘Žπ‘¦π‘’π‘Ÿ

1

πΏπ‘Žπ‘¦π‘’π‘Ÿ

2

πΏπ‘Žπ‘¦π‘’π‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

β€œLayer/Step” 1 β€œLayer/Step” 2 β€œLayer/Step” 3 Final output

π‘ˆ

π‘Š

π‘ˆ

π‘Š

π‘ˆ

𝑉

π‘Š2π‘Š π‘Š3π‘₯ π‘₯

π‘¦π‘Š1

Training Recurrent Networks

Cross-entropy loss

𝑃 =ΰ·‘

𝑑,π‘˜

π‘¦π‘‘π‘˜π‘™π‘‘π‘˜ β‡’ β„’ = βˆ’ log 𝑃 =

𝑑

ℒ𝑑 =βˆ’1

𝑇

𝑑

𝑙𝑑 log 𝑦𝑑

Backpropagation Through Time (BPTT)

– Again, chain rule

– Only difference: Gradients survive over time steps

Training RNNs is hard

Vanishing gradients

– After a few time steps the gradients become almost 0

Exploding gradients

– After a few time steps the gradients become huge

Can’t capture long-term dependencies

Alternative formulation for RNNs

An alternative formulation to derive conclusions and

intuitions

𝑐𝑑 = π‘Š β‹… tanh(π‘π‘‘βˆ’1) + π‘ˆ β‹… π‘₯𝑑 + 𝑏

β„’ =

𝑑

ℒ𝑑(𝑐𝑑)

Another look at the gradients

β„’ = 𝐿(𝑐𝑇(π‘π‘‡βˆ’1 … 𝑐1 π‘₯1, 𝑐0;π‘Š ;π‘Š ;π‘Š ;π‘Š)

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•β„’

πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

=πœ•β„’

πœ•ctβ‹…πœ•π‘π‘‘πœ•ctβˆ’1

β‹…πœ•π‘π‘‘βˆ’1πœ•ctβˆ’2

β‹… … β‹…πœ•π‘πœ+1πœ•cΟ„

≀ πœ‚π‘‘βˆ’πœπœ•β„’π‘‘πœ•π‘π‘‘

The RNN gradient is a recursive product of πœ•π‘π‘‘

πœ•ctβˆ’1

𝑅𝑒𝑠𝑑 β†’ short-term factors 𝑑 ≫ 𝜏 β†’ long-term factors

Exploding/Vanishing gradients

πœ•β„’

πœ•π‘π‘‘=

πœ•β„’

πœ•cTβ‹…πœ•π‘π‘‡πœ•cTβˆ’1

β‹…πœ•π‘π‘‡βˆ’1πœ•cTβˆ’2

β‹… … β‹…πœ•π‘π‘‘+1πœ•c𝑐𝑑

πœ•β„’

πœ•π‘π‘‘=

πœ•β„’

πœ•cTβ‹…πœ•π‘π‘‡πœ•cTβˆ’1

β‹…πœ•π‘π‘‡βˆ’1πœ•cTβˆ’2

β‹… … β‹…πœ•π‘1πœ•c𝑐𝑑

< 1 < 1 < 1

πœ•β„’

πœ•π‘Šβ‰ͺ 1⟹ Vanishing

gradient

> 1 > 1 > 1

πœ•β„’

πœ•π‘Šβ‰« 1⟹ Exploding

gradient

Vanishing gradients

The gradient of the error w.r.t. to intermediate cell

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

π‘Š β‹… πœ• tanh(π‘π‘˜βˆ’1)

Vanishing gradients

The gradient of the error w.r.t. to intermediate cell

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

π‘Š β‹… πœ• tanh(π‘π‘˜βˆ’1)

Long-term dependencies get exponentially smaller weights

Gradient clipping for exploding gradients

Scale the gradients to a threshold

1. g β†πœ•β„’

πœ•π‘Š

2. if 𝑔 > πœƒ0:

g β†πœƒ0

𝑔𝑔

else:

print(β€˜Do nothing’)

Simple, but works!

πœƒ0

g

πœƒ0𝑔

𝑔

Pseudocode

Rescaling vanishing gradients?

Not good solution

Weights are shared between timesteps Loss summed

over timesteps

β„’ =

𝑑

ℒ𝑑 βŸΉπœ•β„’

πœ•π‘Š=

𝑑

πœ•β„’π‘‘πœ•π‘Š

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

=

𝜏=1

π‘‘πœ•β„’π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

Rescaling for one timestep (πœ•β„’π‘‘

πœ•π‘Š) affects all timesteps

– The rescaling factor for one timestep does not work for another

More intuitively

πœ•β„’1πœ•π‘Š

πœ•β„’2πœ•π‘Š

πœ•β„’3πœ•π‘Š

πœ•β„’4πœ•π‘Š

πœ•β„’5πœ•π‘Š

πœ•β„’

πœ•π‘Š=

πœ•β„’1

πœ•π‘Š+

πœ•β„’2

πœ•π‘Š+πœ•β„’3

πœ•π‘Š+

πœ•β„’4

πœ•π‘Š+

πœ•β„’5

πœ•π‘Š

πœ•β„’1πœ•π‘Š

πœ•β„’4πœ•π‘Š

πœ•β„’2πœ•π‘Š πœ•β„’3

πœ•π‘Šπœ•β„’5πœ•π‘Š

More intuitively

Let’s say πœ•β„’1

πœ•π‘Šβˆ 1,

πœ•β„’2

πœ•π‘Šβˆ 1/10,

πœ•β„’3

πœ•π‘Šβˆ 1/100,

πœ•β„’4

πœ•π‘Šβˆ 1/1000,

πœ•β„’5

πœ•π‘Šβˆ 1/10000

πœ•β„’

πœ•π‘Š=

π‘Ÿ

πœ•β„’π‘Ÿπœ•π‘Š

= 1.1111

If πœ•β„’

πœ•π‘Šrescaled to 1

πœ•β„’5

πœ•π‘Šβˆ 10βˆ’5

Longer-term dependencies negligible

– Weak recurrent modelling

– Learning focuses on the short-term only

πœ•β„’

πœ•π‘Š=

πœ•β„’1

πœ•π‘Š+

πœ•β„’2

πœ•π‘Š+πœ•β„’3

πœ•π‘Š+

πœ•β„’4

πœ•π‘Š+

πœ•β„’5

πœ•π‘Š

Recurrent networks ∝ Chaotic systems

Fixing vanishing gradients

Regularization on the recurrent weights

– Force the error signal not to vanish

Ξ© =

𝑑

Ξ©t =

𝑑

πœ•β„’πœ•π‘π‘‘+1

πœ•π‘π‘‘+1πœ•π‘π‘‘

πœ•β„’πœ•π‘π‘‘+1

βˆ’ 1

2

Advanced recurrent modules

Long-Short Term Memory module

Gated Recurrent Unit module

Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013

Advanced Recurrent Nets

+

𝜎𝜎𝜎

tanh

tanh

Input

Output

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Let’s have a look at the loss function

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Let’s have a look at the loss function

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Solution: have an activation function with gradient equal to 1

πœ•β„’π‘‘πœ•π‘Š

=

𝜏=1

tπœ•β„’π‘Ÿπœ•π‘¦π‘‘

πœ•π‘¦π‘‘πœ•π‘π‘‘

πœ•π‘π‘‘πœ•π‘πœ

πœ•π‘πœπœ•π‘Š

πœ•π‘π‘‘πœ•π‘πœ

= ΰ·‘

𝑑β‰₯π‘˜β‰₯𝜏

πœ•π‘π‘˜πœ•π‘π‘˜βˆ’1

– Identify function has a gradient equal to 1

By doing so, gradients do not become too small not too large

Long Short-Term MemoryLSTM: Beefed up RNN

𝑖 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

InputOutput

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

More info at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

𝑐𝑑 = π‘Š β‹… tanh(π‘π‘‘βˆ’1) + π‘ˆ β‹… π‘₯𝑑 + 𝑏

LSTM

Simple

RNN

- The previous state π‘π‘‘βˆ’1 is connected to new

𝑐𝑑 with no nonlinearity (identity function).

- The only other factor is the forget gate 𝑓which rescales the previous LSTM state.

Cell state

The cell state carries the essential information over time

𝑖 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

Cell state line

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

LSTM nonlinearities

𝜎 ∈ (0, 1): control gate – something like a switch

tanh ∈ βˆ’1, 1 : recurrent nonlinearity

𝑖 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

Input

Output

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

LSTM Step-by-Step: Step (1)

E.g. LSTM on β€œYesterday she slapped me. Today she loves me.”

Decide what to forget and what to remember for the new memory

– Sigmoid 1 Remember everything

– Sigmoid 0 Forget everything

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

LSTM Step-by-Step: Step (2)

Decide what new information is relevant from the new input

and should be add to the new memory

– Modulate the input 𝑖𝑑– Generate candidate memories 𝑐𝑑

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

+

𝜎𝜎𝜎

tanh

tanh

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

LSTM Step-by-Step: Step (3)Compute and update the current cell state 𝑐𝑑

– Depends on the previous cell state

– What we decide to forget

– What inputs we allow

– The candidate memories

+

𝜎𝜎𝜎

tanh

tanh

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

LSTM Step-by-Step: Step (4)

Modulate the output

– Does the new cell state relevant? Sigmoid 1

– If not Sigmoid 0

Generate the new memory

+

𝜎𝜎𝜎

tanh

tanh

𝑖𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑖) +π‘šπ‘‘βˆ’1π‘Š

(𝑖)

𝑓𝑑 = 𝜎 π‘₯π‘‘π‘ˆ(𝑓) +π‘šπ‘‘βˆ’1π‘Š

(𝑓)

π‘œπ‘‘ = 𝜎 π‘₯π‘‘π‘ˆ(π‘œ) +π‘šπ‘‘βˆ’1π‘Š

(π‘œ)

𝑐𝑑 = tanh(π‘₯π‘‘π‘ˆπ‘” +π‘šπ‘‘βˆ’1π‘Š

(𝑔))

𝑐𝑑 = π‘π‘‘βˆ’1 βŠ™π‘“ + 𝑐𝑑 βŠ™ 𝑖

π‘šπ‘‘ = tanh 𝑐𝑑 βŠ™π‘œ

𝑓𝑑

π‘π‘‘βˆ’1 𝑐𝑑

π‘œπ‘‘π‘–π‘‘

𝑐𝑑

π‘šπ‘‘π‘šπ‘‘βˆ’1

π‘₯𝑑

LSTM Unrolled Network

Macroscopically very similar to standard RNNs

The engine is a bit different (more complicated)

– Because of their gates LSTMs capture long and short term

dependencies

Γ— +

𝜎𝜎𝜎

Γ—

tanh

Γ—

tanh

Γ— +

𝜎𝜎𝜎

Γ—

tanh

Γ—

tanh

Γ— +

𝜎𝜎𝜎

Γ—

tanh

Γ—

tanh

Beyond RNN & LSTM

LSTM with peephole connections

– Gates have access also to the previous cell states π‘π‘‘βˆ’1 (not only memories)

– Coupled forget and input gates, 𝑐𝑑 = 𝑓𝑑 βŠ™ π‘π‘‘βˆ’1 + 1 βˆ’ 𝑓𝑑 βŠ™ 𝑐𝑑– Bi-directional recurrent networks

Gated Recurrent Units (GRU)

Deep recurrent architectures

Recursive neural networks

– Tree structured

Multiplicative interactions

Generative recurrent architectures

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

Take-away message

Recurrent Neural Networks (RNN) for sequences

Backpropagation Through Time

Vanishing and Exploding Gradients and Remedies

RNNs using Long Short-Term Memory (LSTM)

Applications of Recurrent Neural Networks

Thank you!