CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2...

59
CIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1

Transcript of CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2...

Page 1: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

CIS 522: Lecture 12

RNNs and LSTMs2/25/2020

1

Page 2: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Feedback

HW2 took mean=25h. Will somewhat shorten next yearSome did it in <5. How?HW3 is, afaik, the last long HW.

We continue to get “more code”, “more examples” and “more theory” requests. Next year fewer guest lectures.

2

Page 3: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Exam + Projects

3

Page 4: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Today's Agenda● Some neuroscience background● Review of NLP and variable-length problems● RNNs

○ Backpropagation through time (BPTT)● LSTMs

○ Motivation○ Forget gates

● BiLSTMs○ Motivation○ Architecture

4

Page 5: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Neuroscience: The ubiquity of memory across timescales

PPF: Zhang et al 2015

Ca2+ spikes:

Larkum

Recurrent

activity

Plasticity

5

Page 6: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Working memoryPencilAutomobileEvil

6

Page 7: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Working memoryGraphicElementMediocreSteelPlateNociceptionMemory

7

Page 8: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Working memoryRandomBottlePhoneOvereagerCardPlantChimneyTreeHouseRoof

8

Page 9: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

A multitude of memory elementsWorking memoryEpisodic memoryProcedural memoryDeclarative memory

9

Page 10: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Hippocampus

10

Page 11: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Thoughts have persistence over time. How can we capture this with an

architecture?

11

Page 12: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

The API for feedforward nets is too restrictive.

12

Page 13: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

How does the number of parameters scale?PollEVWe increase the relevant time horizon. Fully connected network

13

Page 14: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

What if we use a deep convnet?Fixed strideFixed filter size on each layer

Does it help?

14

Page 15: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

RNNs: We'd like to have a notion of time and maintain state.

15

Page 16: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

We can't learn a new function for each timestep! How do we simplify our architecture?

16

Page 17: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

We can't learn a new function for each timestep! How do we simplify our architecture?

The way we think doesn't change from moment to moment.

17

Page 18: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

We share weights across time.

18

Page 19: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Recurrent neural networks (RNNs)● Prior: the function modifying the state of a thought is

invariant to temporal shifts.

19

Page 20: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

PollEV how many parameters?

20

Page 21: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

RNNs typically have (at least) 3 weight tensors (U, W, V) and 2 biases.

21

Page 22: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Activation functionsUsually tanhWant to keep activations from exploding

22

Page 23: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

How does the compute tree look like for an RNN?

23

Page 24: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

What do we want to produce?A vector (see last lecture)Text outputText translationsVideo translations Pose tracking etc

24

Page 25: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Sequence-to-Sequence (Seq2Seq) Learning

25

Page 26: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Example application

Towards data science26

Page 27: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

The exact unrolled architecture depends on the learning problem.

Variable-length input, single output at the end.

Variable-length input, simultaneous outputs.

27

Page 28: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Teacher forcing helps learn complex patterns concurrently.

28

Page 29: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Backpropagation through time (BPTT)

29

Page 30: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

RNNs are awful at learning long-term patterns.

30

Page 31: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

RNNs are awful at learning long-term patterns.

What will the gradients do?

31

Page 32: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Which properties do we want memory in an RNN to haveStore information for arbitrary durationBe resistant to noiseBe trainable

32

Page 33: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

A hard task for thisA relevant sequence (length L)Followed by an irrelevant sequence (length T>>L)Answer at endIntuition: figure out relevant information. Then keep it for L steps. Trivial solution: using only one tanh neuron!

: Latch to +/- 133

Page 34: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Dynamics and failure

34

Page 35: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Let us think about the gradientsWe want that the irrelevant input has little inputSo the input has little influenceWhat does that mean for the gradient?

35

Page 36: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

LSTMs

Leaning on Colah’s blog: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

36

Page 37: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

RNN

37

Page 38: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

LSTM

38

Page 39: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Stepping through the LSTM: delete if makes senseSay cell should contain if last image I saw is a puppy

Step 1: forget if new subject

39

Page 40: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Write new value if makes sense

Tanh kinda binarizes the to-be-stored values. Even more resilient.

Step 2: write if new subject

40

Page 41: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Commit to memory

Step 3: commit to hidden state

41

Page 42: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Get output hidden state

42

Page 43: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Variant: Look at memory

Why not also directly see the memory?

Gers and Schmidthuber 2000 43

Page 44: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Variant: Couple read and write

Couple delete and write, but can not do decay anymore!

44

Page 45: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Gated Recurrent Unit

45

Page 46: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Success stories of LSTMsSOTA in Precipitation nowcasting (2016)SOTA named entity (2016)Neural decoding (spykesML)Countless others

Above all: easy to train. Crazy useful for just about anything with time

46

Page 47: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

BiLSTMs

47

Page 48: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Architecture of a biLSTM

48

Page 49: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

BiLSTM performance● In practice, performs better and trains faster than vanilla LSTM● Shorter paths to much of the information. Shorter path = better gradients● Speculation as to why exactly it is better - i.e ensures past and future data

are equally weighted - but no one really knows.

49

Page 50: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Success storiesSOTA in POS tagging 2015 (Bidirectional LSTM-CRF Models for Sequence Tagging)SOTA in speech recognition 2013

Above all: it is reasonably fast to train. Workhorse for countless research applications

50

Page 51: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

InferSent● Achieved good results in:

○ Semantic Textual Similarity○ Paraphrase detection○ Caption-image retrieval

● Outperforms newer, more sophisticated models like BERT in tasks such as Semantic Textual Similarity

51

Page 52: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

InferSent architectureModel Architecture

52

Page 53: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Finetuning InferSent fro Natural Language Inference

entailment, contradiction and neutral

53

Page 54: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

Embeddings from Language Models (ELMo)• BiLSTM developed in 2018 that generates word embeddings based on the

context it appears in• Achieved state of the art in many NLP tasks including:

– Question Answering– Sentiment Analysis for movie reviews– Sentiment Analysis

• Has since been surpassed by BERT-like architectures which we will cover soon!

54

Page 55: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

ELMo architecture

55

Page 56: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

What we learned today● Discussed modern NLP for variable-length problems● RNNs

○ Architecture○ Backpropagation through time (BPTT)

● LSTMs○ Motivation○ Forget gates

● BiLSTMs○ Motivation○ Architecture

56

Page 57: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

PyTorch implementation

57

Page 58: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

LSTM

58

Page 59: CIS 522: Lecture 12 - Penn EngineeringCIS 522: Lecture 12 RNNs and LSTMs 2/25/2020 1 Feedback HW2 took mean=25h. Will somewhat shorten next year Some did it in

biLSTM

59