Sequences in Caffetutorial.caffe.berkeleyvision.org/caffe-cvpr15-sequences.pdf · Sequence Learning...

Sequences in CaffeJeff Donahue

CVPR Caffe Tutorial June 6, 2015

Sequence Learning• Instances of the form x = <x1, x2, x3, …, xT>

• Variable sequence length T

• Learn a transition function f with parameters W:

• f should update hidden state ht and output yt

h0 := 0

for t = 1, 2, 3, …, T:

<yt, ht> = fW(xt, ht-1)

ht-1

xt

ht

yt

f

Sequence Learning

0

x1

h1

z1

f

x2

h2

z2

f

xT

hT

zT

fhT-1…

Equivalent to a T-layer deep network, unrolled in time

Sequence Learning• What should the transition function f be?

!

!

!

!

• At a minimum, we want something non-linear and differentiable

ht-1

xt

ht

zt

f

Sequence Learning• A “vanilla” RNN:

ht = σ(Whxxt + Whhht-1 + bh)

zt = σ(Whzht + bz)

• Problems

• Difficult to train — vanishing/exploding gradients

• Unable to “select” inputs, hidden state, outputs

Sequence Learning

Long Short-Term Memory (LSTM) Proposed by Hochreiter and Schmidhuber, 1997

Sequence Learning

LSTM (Hochreiter &

Schmidhuber, 1997)

• Allows long-term dependencies to be learned

• Effective for • speech

recognition • handwriting

recognition • translation • parsing

Sequence Learning

LSTM (Hochreiter &

Schmidhuber, 1997)

1 ct-1

ct-10 0

Exactly remember previous

cell state — discard input

Sequence Learning

LSTM (Hochreiter &

Schmidhuber, 1997)

0 0

1 Wxt

Forget previous

cell state — only remember

input

Activity Recognition

LSTM

CNN

LSTM

CNN

LSTM

CNN

LSTM

CNN

talking yawning phoning phoning

Average

phoning

sequential input

Image Description

CNN

LSTM LSTM LSTM LSTM LSTMsequential

outputa<BOS> man is talking <EOS>

Video Description

LSTM

CNN

LSTM

LSTM

LSTM

LSTM LSTM LSTM LSTMLSTM

CNN

LSTM

LSTM

CNN

LSTM

LSTM

CNN

LSTM

sequential input & output

<BOS>

LSTM LSTMLSTMLSTM

<EOS>talkingismana

Sequence learning features now available in Caffe. Check out PR #2033

“Unrolled recurrent layers (RNN, LSTM)”

Training Sequence Models• At training time, want the model to predict the next time

step given all previous time steps: p(wt+1 | w1:t) • Example: A bee buzzes.

input output

0 <BOS> a

1 a bee

2 bee buzzes

3 buzzes <EOS>

Sequence Input Format• First input: “cont” (continuation) indicators (T x N) • Second input: data (T x N x D)

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

1 1 0

tree falls <EOS>

1 1 1

N = 2, T = 6batch 1 batch 2…

Sequence Input Format• Inference is exact over infinite batches • Backpropagation approximate — truncated at batch

boundaries

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

1 1 0

tree falls <EOS>

1 1 1

N = 2, T = 6batch 1 batch 2…

Sequence Input Format• Words are usually represented as one-hot vectors

0 0 0 . . 0 1 0 . . 0

“bee”

8000D vector; all 0s except

index of wordvocabulary

with 8000 words

a the am . .

cat bee dog

.

. run

0 1 2 . .

3999 4000 4001

.

. 7999

Sequence Input Format• EmbedLayer projects one-hot vector

0 0 0 . . 0 1 0 . . 0

“bee”

8000D vector; all 0s except

index of word

4000 (index of the word “bee”

in our vocabulary)

layer { type: “Embed” embed_param { input_dim: 8000 num_output: 500 } }

500D embedding

of “bee”

2.7e-2 1.5e-1 3.1e+2

…

Image Description

Video Description

LSTM

CNN

LSTM

LSTM

LSTM

LSTM LSTM LSTM LSTMLSTM

CNN

LSTM

LSTM

CNN

LSTM

LSTM

CNN

LSTM

<BOS>

LSTM LSTMLSTMLSTM

<EOS>talkingismana

Venugopalan et al., “Sequence to Sequence -- Video to Text,” 2015. http://arxiv.org/abs/1505.00487

N timesteps: watch video M timesteps: produce caption

http://arxiv.org/abs/1505.00487

Sequences in Caffetutorial.caffe.berkeleyvision.org/caffe-cvpr15-sequences.pdf · Sequence Learning...

Documents

Transcript of Sequences in Caffetutorial.caffe.berkeleyvision.org/caffe-cvpr15-sequences.pdf · Sequence Learning...