Lecture 6: LSTMs

HarvardAC295/CS287r/CSCI E-115BChris Tanner

Contextualized, Token-based Representations

Lecture 6: LSTMs

Stayin' alive was no jive […] but it was just a dream for the

teen, who was a fiend. Started [hustlin’ at] 16, and runnin' up

in [LSTM gates].

-- Raekwon of Wu-Tang (1994)

ANNOUNCEMENTS• HW2 coming very soon. Due in 2 weeks.

• Research Proposals are due in 9 days, Sept 30.

• Office Hours:

• This week, my OH will be pushed back 30 min: 3:30pm – 5:30pm

• Please reserve your coding questions for the TFs and/or EdStem, as I hold

office hours solo, and debugging code can easily bottleneck the queue.

RECAP: L4

Distributed Representations: dense vectors (aka embeddings) that aim to

convey the meaning of tokens:

• “word embeddings” refer to when you have type-based

representations

• “contextualized embeddings” refer to when you have token-based

representations

RECAP: L4An auto-regressive LM is one that only has access to the previous tokens

(and the outputs become the inputs).

Evaluation: Perplexity

A masked LM can peak ahead, too. It “masks” a word within the context

(i.e., center word).

Evaluation: downstream NLP tasks that uses the learned embeddings.

Both of these can produce useful word embeddings.

RECAP: L5• RNNs help capture more context

while avoiding sparsity, storage,

and compute issues!

• The hidden layer is what we care

about. It represents the word’s “meaning”.

Input layer

Hidden layer

Output layer

#𝑦!

Recurrent Neural Nets (RNNs)

Long Short-Term Memory (LSTMs)

Bi-LSTM and ELMo

Outline

Bi-LSTM and ELMo

Outline

Input layer

Hidden layer

Output layer

#𝑦"

Training Process

𝐶𝐸 𝑦!, '𝑦!Error

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

𝑦#" log( '𝑦#" )RNN

Input layer

Hidden layer

Output layer

#𝑦"

#𝑦0

𝑥" 𝑥0

Training Process

𝐶𝐸 𝑦&, '𝑦&𝐶𝐸 𝑦!, '𝑦!Error

She went

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

Input layer

Hidden layer

Output layer

#𝑦"

#𝑦0 #𝑦1

𝑥" 𝑥0 𝑥1

𝑉 𝑉

Training Process

𝐶𝐸 𝑦&, '𝑦& 𝐶𝐸 𝑦', '𝑦'𝐶𝐸 𝑦!, '𝑦!Error

She went to

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

Input layer

Hidden layer

Output layer

#𝑦"

#𝑦0 #𝑦1 #𝑦2

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

Training Process

𝐶𝐸 𝑦&, '𝑦& 𝐶𝐸 𝑦', '𝑦' 𝐶𝐸 𝑦(, '𝑦(𝐶𝐸 𝑦!, '𝑦!Error

She went to class

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

Input layer

Hidden layer

Output layer

#𝑦"

#𝑦0 #𝑦1 #𝑦2

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

Training Process

She went to class

During training, regardless of our output predictions,we feed in the correct inputs

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

Training Process

Input layer

Hidden layer

Output layer

𝑈𝑉 𝑉 𝑉

She went to class

went? over? class? after?

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

Training Process

Input layer

Hidden layer

Output layer

𝑈𝑉 𝑉 𝑉

She went to class

𝐶𝐸 𝑦" , '𝑦" = −*#∈%

𝑦#" log( '𝑦#" )

Our total loss is simply the average loss across all 𝑻 time steps

Training Details

Output layer

𝑈 𝑈 𝑈 𝑈𝑉 𝑉 𝑉

𝐶𝐸 𝑦(, '𝑦(

#𝑦Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

To update our weights (e.g. 𝑽), we calculate the gradient

of our loss w.r.t. the repeated weight matrix (e.g., 𝝏𝑳𝝏𝑽).

Training Details

Input layer

Hidden layer

Output layer

𝑈𝑉 𝑉 𝑉1

She went to class

went? over? class?

𝝏𝑳𝝏𝑽

Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

Training Details

Output layer

𝑈 𝑈 𝑈 𝑈𝑉 𝑉1

went? over? class?

𝝏𝑳𝝏𝑽

Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

RNN Training Details

Output layer

𝑈 𝑈 𝑈 𝑈𝑉1

went? over? class?

#𝑦Using the chain rule, we trace the derivative all the way back to the beginning, while summing the results.

𝝏𝑳𝝏𝑽

𝑉0𝑉"

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

Training Details

• This backpropagation through time (BPTT) process is expensive

• Instead of updating after every timestep, we tend to do so

every 𝑇 steps (e.g., every sentence or paragraph)

• This isn’t equivalent to using only a window size 𝑇

(a la n-grams) because we still have ‘infinite memory’

We can generate the most likely next event (e.g., word) by sampling from +𝒚

Continue until we generate <EOS> symbol.

RNN: Generation

Input layer

Hidden layer

Output layer

𝑥"<START>

“Sorry”

RNN: Generation

Input layer

Hidden layer

Output layer

𝑥" 𝑥0

<START> “Sorry”

“Sorry” Harry

RNN: Generation

Input layer

Hidden layer

Output layer

𝑥" 𝑥0 𝑥1

𝑉 𝑉

<START> “Sorry” Harry

“Sorry” Harry shouted,

RNN: Generation

Input layer

Hidden layer

Output layer

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

<START> “Sorry” Harry shouted,

“Sorry” Harry shouted, panicking

RNN: Generation

NOTE: the same input (e.g., “Harry”) can easily yield different outputs, depending on the context (unlike FFNNs and n-grams).

Input layer

Hidden layer

Output layer

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

<START> “Sorry” Harry shouted,

“Sorry” Harry shouted, panicking

RNN: Generation

When trained on Harry Potter text, it generates:

Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da627

RNN: Generation When trained on recipes

Source: https://gist.github.com/nylki/1efbaa36635956d35bcc28

RNNs: Overview

RNN ISSUES?

RNN STRENGTHS?

• Can handle infinite-length sequences (not just a fixed-window)

• Has a “memory” of the context (thanks to the hidden layer’s recurrent loop)

• Same weights used for all inputs, so positionality isn’t wonky/overwritten (like FFNN)

• Slow to train (BPTT)

• Due to ”infinite sequence”, gradients can easily vanish or explode

• Has trouble actually making use of long-range context29

RNNs: Overview

RNN ISSUES?

RNN STRENGTHS?

• Can handle infinite-length sequences (not just a fixed-window)

• Has a “memory” of the context (thanks to the hidden layer’s recurrent loop)

• Same weights used for all inputs, so positionality isn’t wonky/overwritten (like FFNN)

• Slow to train (BPTT)

• Due to ”infinite sequence”, gradients can easily vanish or explode

• Has trouble actually making use of long-range context30

RNNs: Vanishing and Exploding Gradients

𝝏𝑳𝟒

𝝏𝑽𝟏

𝑉0𝑉"

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

𝝏𝑳𝟒

𝝏𝑽𝟏= ?

𝝏𝑳𝟒

𝝏𝑽𝟏

𝑉0𝑉"

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

𝝏𝑳𝟒

𝝏𝑽𝟏= 𝝏𝑳𝟒

𝝏𝑽𝟑

𝝏𝑳𝟒

𝝏𝑽𝟏

𝑉0𝑉"

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

𝝏𝑳𝟒

𝝏𝑽𝟏= 𝝏𝑳𝟒

𝝏𝑽𝟑𝝏𝑽𝟑

𝝏𝑽𝟐

𝝏𝑳𝟒

𝝏𝑽𝟏

𝑉0𝑉"

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

𝝏𝑳𝟒

𝝏𝑽𝟏= 𝝏𝑳𝟒

𝝏𝑽𝟑𝝏𝑽𝟑

𝝏𝑽𝟐𝝏𝑽𝟐

𝝏𝑽𝟏

𝝏𝑳𝟒

𝝏𝑽𝟏

𝑉0𝑉"

Input layer

Hidden layer

𝑊 𝑊 𝑊 𝑊

She went to class

𝝏𝑳𝟒

𝝏𝑽𝟏= 𝝏𝑳𝟒

𝝏𝑽𝟑𝝏𝑽𝟑

𝝏𝑽𝟐𝝏𝑽𝟐

𝝏𝑽𝟏

This long path makes it easy for the gradients to become really small or large.

If small, the far-away context will be ”forgotten.”

If large, recency bias and no context.

Activation Functions

Sigmoid

Derivatives of Activation Functions

Sigmoid

Chain Reactions

Tanh activations for all

Lipschitz

A real-valued function 𝑓: ℝàℝ is Lipschitz continuous if

∃ 𝐾 s.t. ∀𝑥!, 𝑥", 𝑓 𝑥! − 𝑓 𝑥" ≤ 𝐾 𝑥! − 𝑥"

Re-worded, if 𝑥! ≠ 𝑥":

𝑓 𝑥! − 𝑓 𝑥"𝑥! − 𝑥"

≤ 𝐾

Gradients

We assert our Neural Net’s objective function f is well-behaved and

Lipschitz continuous w/ a constant L

𝑓 𝑥 − 𝑓 𝑦 ≤ 𝐿 𝑥 − 𝑦

We update our parameters by 𝜂𝒈

⟹ 𝑓 𝑥 − 𝑓 𝑥 − 𝜂𝒈 ≤ 𝐿𝜂 𝒈

Gradients

We assert our Neural Net’s objective function f is well-behaved and

Lipschitz continuous w/ a constant L

𝑓 𝑥 − 𝑓 𝑦 ≤ 𝐿 𝑥 − 𝑦

We update our parameters by 𝜂𝒈

⟹ 𝑓 𝑥 − 𝑓 𝑥 − 𝜂𝒈 ≤ 𝐿𝜂 𝒈

This means, we will never observe a change by more than 𝐿𝜂 𝒈

Gradients

We update our parameters by 𝜂𝒈 ⟹ 𝑓 𝑥 − 𝑓 𝑥 − 𝜂𝒈 ≤ 𝐿𝜂 𝒈

PRO:• Limits the extent to which things can go wrong if we move in the wrong

direction

CONS:• Limits the speed of making progress

• Gradients may still become quite large and the optimizer may not converge

How can we limit the gradients from being so large? 42

Exploding Gradients

Source: https://www.deeplearningbook.org/contents/rnn.html Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf43

Gradient Clipping

• Ensures the norm of the gradient never exceeds the threshold 𝜏

• Only adjusts the magnitude of each gradient, not the direction (good).

• Helps w/ numerical stability of training; no general improvement in

performance

Bi-LSTM and ELMo

Outline

Bi-LSTM and ELMo

• A type of RNN that is designed to better handle long-range

dependencies

• In ”vanilla” RNNs, the hidden state is perpetually being rewritten

• In addition to a traditional hidden state h, let’s have a dedicated

memory cell c for long-term events. More power to relay

sequence info.

At each each time step 𝑡, we have a hidden state ℎ6 and cell state 𝑐6:

• Both are vectors of length n

• cell state 𝑐6 stores long-term info

• At each time step 𝑡, the LSTM erases, writes, and reads information from the cell 𝑐6

• 𝑐6 never undergoes a nonlinear activation though, just – and +

• + of two things does not modify the gradient; simply adds the gradients

Input layer

Hidden layer

Output layer

𝑥" 𝑥0

𝐶 and 𝐻 relay long- and short-term memory to the hidden layer, respectively. Inside the hidden layer, there are many weights.

𝐻67"

𝐶67"

𝐻68"

𝐶68"

Diagram: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 50

𝐻67"

𝐶67"

𝐻68"

𝐶68"some old memories are “forgotten”

𝐻67"

𝐶67"

𝐻68"

𝐶68"some old memories are “forgotten” some new memories are made

𝐻67"

𝐶67"

𝐻68"

a nonlinear weighted version of the long-term memory becomes our short-term memory

𝐻67"

𝐶67"

𝐻68"

a nonlinear weighted version of the long-term memory becomes our short-term memory

memory is written, erased, and read by three gates – which are influenced by 𝒙 and 𝒉

Forget Gate

Imagine the cell state currently has values related to a previous topic. Convo has shifted.

Input Gate

Decides which values to update (by scaling each of the new, incoming info).

Cell State

The cell state forgets some info, then it’s simultaneously updated by us adding to it.

Hidden State

Decide what our new hidden representation will be. Based on:- filtered version of the cell state, and- weighted version of recurrent, hidden layer

It’s still possible for LSTMs to suffer from vanishing/exploding

gradients, but it’s way less likely than with vanilla RNNs:

• If RNNs wish to preserve info over long contexts, it must delicately

find a recurrent weight matrix 𝑊9 that isn’t too large or small

• However, LSTMs have 3 separate mechanism that adjust the flow of

information (e.g., forget gate, if turned off, will preserve all info)

The cell learns an operative time to “turn on”.

LSTM ISSUES?

LSTM STRENGTHS?

• Almost always outperforms vanilla RNNs

• Captures long-range dependencies shockingly well

• Has more weights to learn than vanilla RNNs; thus,

• Requires a moderate amount of training data (otherwise, vanilla RNNs are better)

• Can still suffer from vanishing/exploding gradients62

Sequential Modelling

If your goal isn’t to predict the next item in a sequence, and you rather

do some other classification or regression task using the sequence,

then you can:

• Train an aforementioned model (e.g., LSTM) as a language model

• Use the hidden layers that correspond to each item in your

sequence

IMPORTANT

Input layer

Hidden layer

Output layer

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

Language Modelling 1-to-1 tagging/classification

Input layer

Hidden layer

Output layer

#𝑦"

#𝑦0 #𝑦1 #𝑦2

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

𝑥0 𝑥1 𝑥2 𝑥:

Auto-regressive Non-Auto-regressive

Many-to-1 classification

Sentiment score

Input layer

Hidden layer

Output layer

#𝑦2

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

Many-to-1 classification

Input layer

Hidden layer

Output layer

𝑥" 𝑥0 𝑥1 𝑥2

𝑉 𝑉 𝑉

Sentiment score

This concludes the foundation in sequential representation.

Most state-of-the-art advances are based on those core

RNN/LSTM ideas. But, with tens of thousands of researchers and

hackers exploring deep learning, there are many tweaks that

haven proven useful.

(This is where things get crazy.)

Outline

Bi-LSTM and ELMo

Outline

Bi-LSTM and ELMo

RNNs/LSTMs use the left-to-right context and sequentially

process data.

If you have full access to the data at testing time, why not

make use of the flow of information from right-to-left, also?

RNN Extensions: Bi-directional LSTMs

Input layer

Hidden layer

𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2;

For brevity, let’s use the follow schematic to represent an RNN

Input layer

Hidden layer

𝑥" 𝑥0 𝑥1 𝑥2 𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2; ℎ"< ℎ0< ℎ1< ℎ2<

For brevity, let’s use the follow schematic to represent an RNN

Input layer

Hidden layer

𝑥" 𝑥0 𝑥1 𝑥2 𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2; ℎ"< ℎ0< ℎ1< ℎ2<

ℎ";ℎ"<

ℎ0;ℎ0<

ℎ1;ℎ1<

ℎ2;ℎ2<Concatenate the hidden layers

Input layer

Hidden layer

Output layer

𝑥" 𝑥0 𝑥1 𝑥2 𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2; ℎ"< ℎ0< ℎ1< ℎ2<

ℎ";ℎ"<

ℎ0;ℎ0<

ℎ1;ℎ1<

ℎ2;ℎ2<

#𝑦" #𝑦0 #𝑦1 #𝑦2

Concatenate the hidden layers

• Usually performs at least as well as uni-directional RNNs/LSTMs

BI-LSTM ISSUES?

BI-LSTM STRENGTHS?

• Slower to train

• Only possible if access to full data is allowed

RNN Extensions: Stacked LSTMs

Input layer

Hidden layer #1

𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2;

Hidden layers provide an

abstraction (holds “meaning”).

Stacking hidden layers provides

increased abstractions.

Input layer

Hidden layer #1

𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2;

ℎ2;0ℎ1;0ℎ0;0ℎ";0Hidden layer #2

Input layer

Hidden layer #1

𝑥" 𝑥0 𝑥1 𝑥2

ℎ"; ℎ0; ℎ1; ℎ2;

ℎ2;0ℎ1;0ℎ0;0ℎ";0

#𝑦" #𝑦0 #𝑦1 #𝑦2Output layer

Hidden layer #2

ELMo: Stacked Bi-directional LSTMs

General Idea:

• Goal is to get highly rich embeddings for each word (unique type)

• Use both directions of context (bi-directional), with increasing

abstractions (stacked)

• Linearly combine all abstract representations (hidden layers) and

optimize w.r.t. a particular task (e.g., sentiment classification)

ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018 79

Illustration: http://jalammar.github.io/illustrated-bert/ 80

Illustration: http://jalammar.github.io/illustrated-bert/ 81

Deep contextualized word representations. Peters et al. NAACL 2018. 82

• ELMo yielded incredibly good contextualized embeddings, which yielded

SOTA results when applied to many NLP tasks.

• Main ELMo takeaway: given enough training data, having tons of explicit

connections between your vectors is useful

(system can determine how to best use context)

ELMo Slides: https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018 83

SUMMARY

• Distributed Representations can be:

• Type-based (“word embeddings”)

• Token-based (“contextualized representations/embeddings”)

• Type-based models include Bengio’s 2003 and word2vec 2013

• Token-based models include RNNs/LSTMs, which:

• demonstrated profound results in 2015 onward.

• it can be used for essentially any NLP task.

BACKUP

Lecture 6: LSTMs

Documents

Transcript of Lecture 6: LSTMs

Overview of LSTMs and word2vec - University of Cambridge · Overview of LSTMs and word2vec and a bit about compositional distributional semantics if there’s time Ann Copestake Computer

Time series predictions using LSTMs

Lecture 6.

Beyond detection: GANs and LSTMs to pay attention at human … · 2017. 10. 27. · Beyond detection: GANs and LSTMs to pay attention at human presence Rita Cucchiara Imagelab, Dipartimento

Week Number:6 Lecture: 6 Lecture Title: The Multiplier Effect.

Lecture 6 Lecture 6 Logic Simulation Logic

Sentence compression by deletion with LSTMs

CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and … · CS224d: Deep NLP Lecture 9: Wrap up: LSTMs and Recursive Neural Networks Richard Socher richard@metamind.io. Overview 2 Richard

Lecture ( 6 )

Scalable GP-LSTMs with Semi-Stochastic Gradients · 2019-10-17 · Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu

Image-Based Localization Using LSTMs for Structured ...openaccess.thecvf.com/.../papers/...Localization_Using_ICCV_2017_p… · Image-based localization using LSTMs for structured

Multivariate Aviation Time Series Modeling: VARs vs. LSTMs€¦ · Multivariate Aviation Time Series Modeling: VARs vs. LSTMs ... crete pilot commands. In [11] the authors used a

Video Fill In the Blank using LR/RL LSTMs with Spatial ...Video Fill In the Blank using LR/RL LSTMs with Spatial-Temporal Attentions Amir Mazaheri y, Dong Zhang , and Mubarak Shah

Deep Learning Class #3 - Take Two LSTMs

Pathology, Lecture 6 (Lecture notes)

CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Lecture 6 - deeparnab.github.iodeeparnab.github.io/Courses/S17/Lectures/lec6.pdf · Lecture 6 Wednesday, April 12, 2017 1:57 PM Lecture 6 Page 1 . Lecture 6 Page 2 . Lecture 6 Page

Assessing the Ability of LSTMs to Learn Syntax …tallinzen.net/media/papers/linzen_dupoux_goldberg_2016_tacl.pdfAssessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

On-line Human Gait Stability Prediction using LSTMs …On-line Human Gait Stability Prediction using LSTMs for the fusion of Deep-based Pose Estimation and LRF-based Augmented Gait

Lecture # 6