Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow...

54
Lecture 9: RNNs (part 2) + Applications Topics in AI (CPSC 532S): Multimodal Learning with Vision, Language and Sound

Transcript of Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow...

Page 1: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Lecture 9: RNNs (part 2) + Applications

Topics in AI (CPSC 532S): Multimodal Learning with Vision, Language and Sound

Page 2: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

BackProp Through Time

Loss

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 3: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Truncated BackProp Through Time

Loss

Run backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 4: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Truncated BackProp Through TimeRun backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence

Loss Carry hidden states forward, but only BackProp through some smaller number of steps

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 5: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Truncated BackProp Through TimeRun backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence

Loss

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 6: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Implementation: Relatively Easy

… you will have a chance to experience this in the Assignment 3

Page 7: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Learning to Write Like Shakespeare

x

RNN

y

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 8: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

train more

train more

train more

at first:

Learning to Write Like Shakespeare … after training a bit

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 9: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Learning to Write Like Shakespeare … after training

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 10: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Learning CodeTrained on entire source code of Linux kernel

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 11: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

DopeLearning: Computational Approach to Rap Lyrics

[ Malmi et al., KDD 2016 ]

Page 12: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Sunspring: First movie generated by AI

Page 13: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Multilayer RNNs

time

depth

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 14: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Vanilla RNN Gradient Flow

ht-1

xt

W

stack

tanh

ht

[ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 15: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Vanilla RNN Gradient Flow

ht-1

xt

W

stack

tanh

ht

[ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

Backpropagation from ht to ht-1 multiplies by W (actually WhhT)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 16: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

h0

x1

Wstack

tanh

h1

x2

Wstack

tanh

h2

x3

Wstack

tanh

h3

x4

Wstack

tanh

h4

Computing gradient of h0 involves many factors of W (and repeated tanh)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 17: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

h0

x1

Wstack

tanh

h1

x2

Wstack

tanh

h2

x3

Wstack

tanh

h3

x4

Wstack

tanh

h4

Largest singular value > 1: Exploding gradients

Largest singular value < 1: Vanishing gradients

Computing gradient of h0 involves many factors of W (and repeated tanh)

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 18: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

h0

x1

Wstack

tanh

h1

x2

Wstack

tanh

h2

x3

Wstack

tanh

h3

x4

Wstack

tanh

h4

Largest singular value > 1: Exploding gradients

Largest singular value < 1: Vanishing gradients

Computing gradient of h0 involves many factors of W (and repeated tanh)

Gradient clipping: Scale gradient if its norm is too big

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 19: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]

[ Pascanu et al., ICML 2013 ]

h0

x1

Wstack

tanh

h1

x2

Wstack

tanh

h2

x3

Wstack

tanh

h3

x4

Wstack

tanh

h4

Largest singular value > 1: Exploding gradients

Largest singular value < 1: Vanishing gradients

Computing gradient of h0 involves many factors of W (and repeated tanh) Change RNN architecture

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 20: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Long-Short Term Memory (LSTM)

Vanilla RNN LSTM

[ Hochreiter and Schmidhuber, NC 1977 ]

* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford

Page 21: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Long-Short Term Memory (LSTM)

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 22: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Long-Short Term Memory (LSTM)

Cell state / memory

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 23: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Intuition: Forget Gate

Should we continue to remember this “bit” of information or not?

Intuition: memory and forget gate output multiply, output of forget gate can be though of as binary (0 or 1)

anything x 0 = 0 (forget)anything x 1 = anything (remember)

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 24: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Intuition: Input Gate

Should we update this “bit” of information or not? If yes, then what should we remember?

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 25: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Intuition: Memory Update

Forget what needs to be forgotten + memorize what needs to be remembered

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 26: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Intuition: Output Gate

Should we output this bit of information (e.g., to “deeper” LSTM layers)?

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 27: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Intuition: Additive UpdatesBackpropagation from ct to ct-1 only elementwise multiplication by

f, no matrix multiply by W

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 28: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Uninterrupted gradient flow!

LSTM Intuition: Additive Updates

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 29: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Uninterrupted gradient flow!

LSTM Intuition: Additive Updates

Input

Softm

ax

3x3

co

nv, 6

4

7x7

co

nv, 6

4 / 2

FC 1000

Pool

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 1

28

3x3

co

nv, 1

28

/ 2

3x3

co

nv, 1

28

3x3

co

nv, 1

28

3x3

co

nv, 1

28

3x3

co

nv, 1

28

...

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

3x3

co

nv, 6

4

Pool

Similar to ResNet

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 30: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Variants: with Peephole Connections

Lets gates see the cell state / memory

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 31: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM Variants: with Coupled Gates

Only memorize new information when you’re forgetting old

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 32: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Gated Recurrent Unit (GRU)

No explicit memory; memory = hidden output

z = memorize new and forget old

Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra

Page 33: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

RNNs: ReviewKey Enablers: — Parameter sharing in computational graphs

— “Unrolling” in computational graphs

— Allows modeling arbitrary length sequences!

Page 34: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

RNNs: Review

x

RNN

y

Vanilla RNN

Key Enablers: — Parameter sharing in computational graphs

— “Unrolling” in computational graphs

— Allows modeling arbitrary length sequences!

Page 35: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

RNNs: Review

x

RNN

yVanishing

or Exploding Gradients

Vanilla RNN

Key Enablers: — Parameter sharing in computational graphs

— “Unrolling” in computational graphs

— Allows modeling arbitrary length sequences!

Page 36: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

RNNs: Review

x

RNN

yVanishing

or Exploding Gradients

Uninterrupted gradient flow!

Vanilla RNN Long-Short Term Memory (LSTM)

Key Enablers: — Parameter sharing in computational graphs

— “Unrolling” in computational graphs

— Allows modeling arbitrary length sequences!

Page 37: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

RNNs: ReviewKey Enablers: — Parameter sharing in computational graphs

— “Unrolling” in computational graphs

— Allows modeling arbitrary length sequences!

x

RNN

yVanishing

or Exploding Gradients

Uninterrupted gradient flow!

Vanilla RNN Long-Short Term Memory (LSTM)

Loss functions: often cross-entropy (for classification); could be max-margin (like in SVM) or Squared Loss (regression)

Page 38: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

LSTM/RNN Challenges

— LSTM can remember some history, but not too long — LSTM assumes data is regularly sampled

Page 39: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Phased LSTM

Gates are controlled by phased (periodic) oscillations

[ Neil et al., 2016 ]

Page 40: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Bi-directional RNNs/LSTMs

h

t

= f

W

(ht�1, xt

)

y

t

= W

hy

h

t

+ b

y

h

t

= tanh(Whh

h

t�1 +W

xh

x

t

+ b

h

)

h

t

= f

W

(ht�1, xt

)

y

t

= W

hy

h

t

+ b

y

h

t

= tanh(Whh

h

t�1 +W

xh

x

t

+ b

h

)

h

t

= f

W

(ht�1, xt

)

y

t

= W

hy

h

t

+ b

y

h

t

= tanh(Whh

h

t�1 +W

xh

x

t

+ b

h

)

Page 41: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

�!h

t

= f

�!W

(�!h

t�1, xt

)

�h

t

= f

�W

( �h

t+1, xt

)

y

t

= W

hy

[�!h

t

,

�h

t

]T + b

y

�!h

t

= tanh(�!W

hh

�!h

t�1 +�!W

xh

x

t

+�!b

h

)

�h

t

= tanh( �W

hh

�h

t+1 + �W

xh

x

t

+ �b

h

)

�!h

t

= f

�!W

(�!h

t�1, xt

)

�h

t

= f

�W

( �h

t+1, xt

)

y

t

= W

hy

[�!h

t

,

�h

t

]T + b

y

�!h

t

= tanh(�!W

hh

�!h

t�1 +�!W

xh

x

t

+�!b

h

)

�h

t

= tanh( �W

hh

�h

t+1 + �W

xh

x

t

+ �b

h

)

�!h

t

= f

�!W

(�!h

t�1, xt

)

�h

t

= f

�W

( �h

t+1, xt

)

y

t

= W

hy

[�!h

t

,

�h

t

]T + b

y

�!h

t

= tanh(�!W

hh

�!h

t�1 +�!W

xh

x

t

+�!b

h

)

�h

t

= tanh( �W

hh

�h

t+1 + �W

xh

x

t

+ �b

h

)

Bi-directional RNNs/LSTMs

�!h

t

= f

�!W

(�!h

t�1, xt

)

�h

t

= f

�W

( �h

t+1, xt

)

y

t

= W

hy

[�!h

t

,

�h

t

]T + b

y

�!h

t

= tanh(�!W

hh

�!h

t�1 +�!W

xh

x

t

+�!b

h

)

�h

t

= tanh( �W

hh

�h

t+1 + �W

xh

x

t

+ �b

h

)

�!h

t

= f

�!W

(�!h

t�1, xt

)

�h

t

= f

�W

( �h

t+1, xt

)

y

t

= W

hy

[�!h

t

,

�h

t

]T + b

y

�!h

t

= tanh(�!W

hh

�!h

t�1 +�!W

xh

x

t

+�!b

h

)

�h

t

= tanh( �W

hh

�h

t+1 + �W

xh

x

t

+ �b

h

)

Page 42: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Attention Mechanisms and RNNs Consider a translation task: This is one of the first neural translation models

English

Enc

oder

Fren

ch D

ecod

er

Summary Vector

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Page 43: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language

English

Enc

oder

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Page 44: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language

English

Enc

oder

Fren

ch D

ecod

er

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Page 45: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language

Fren

ch D

ecod

er

Build a small neural network (one layer) with softmax output that takes (1) everything decoded so far and (encoded by previous decoder state Zi) (2) encoding of the current word (encoded by the hidden state of encoder hj)

and predicts relevance of every source word towards next translation

[ Cho et al., 2015 ]

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Page 46: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language

Fren

ch D

ecod

er

Build a small neural network (one layer) with softmax output that takes (1) everything decoded so far and (encoded by previous decoder state Zi) (2) encoding of the current word (encoded by the hidden state of encoder hj)

and predicts relevance of every source word towards next translation

ci =TX

j=1

↵jhj

[ Cho et al., 2015 ]

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Page 47: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Attention Mechanisms and RNNs [ Cho et al., 2015 ]

https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/

Page 48: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Regularization in RNNs

Standard dropout in recurrent layers does not work because it causes loss of long term memory!

* slide from Marco Pedersoli and Thomas Lucas

Page 49: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Regularization in RNNs

Standard dropout in recurrent layers does not work because it causes loss of long term memory!

—Dropout in input-to-hidden or hidden-to-output layers

— Apply dropout at sequence level (same zeroed units for the entire sequence)

— Dropout only at the cell update (for LSTM and GRU units)

— Enforcing norm of the hidden state to be similar along time

— Zoneout some hidden units (copy their state to the next tilmestep)

[ Zaremba et al., 2014 ]

[ Gal, 2016 ]

[ Semeniuta et al., 2016 ]

[ Krueger & Memisevic, 2016 ]

[ Krueger et al., 2016 ]

* slide from Marco Pedersoli and Thomas Lucas

Page 50: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Teacher Forcing

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02.08

.79Softmax

“e” “l” “l” “o”SampleTraining Objective: Predict the next word (cross entropy loss)

Testing: Sample the full sequence

Page 51: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Teacher Forcing

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02.08

.79Softmax

“e” “l” “l” “o”Sample

Testing: Sample the full sequence

Training and testing objectives are not consistent!

Training Objective: Predict the next word (cross entropy loss)

Page 52: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Teacher Forcing

Slowly move from Teacher Forcing to Sampling

Probability of sampling from the ground truth

[ Bengio et al., 2015 ]

* slide from Marco Pedersoli and Thomas Lucas

Page 53: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Baseline: Google NIC captioning model

Baseline with Dropout: Regularized RNN version

Always sampling: Use sampling from the beginning of training

Scheduled sampling: Sampling with inverse Sigmoid decay

Uniformed scheduled sampling: Scheduled sampling but uniformly

Teacher Forcing

* slide from Marco Pedersoli and Thomas Lucas

Page 54: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1

Sequence Level Training

During training objective is different than at test time — Training: generate next word given the previous

— Test: generate the entire sequence given an initial state

Optimize directly evaluation metric (e.g. BLUE score for sentence generation)

Set the problem as a Reinforcement Learning: — RNN is an Agent

— Policy defined by the learned parameters

— Action is the selection of the next word based on the policy - Reward is the evaluation metric

* slide from Marco Pedersoli and Thomas Lucas

[ Ranzato et al., 2016 ]