Sequences in Caffetutorial.caffe.berkeleyvision.org/caffe-cvpr15-sequences.pdf · Sequence Learning...
Transcript of Sequences in Caffetutorial.caffe.berkeleyvision.org/caffe-cvpr15-sequences.pdf · Sequence Learning...
Sequences in CaffeJeff Donahue
CVPR Caffe Tutorial June 6, 2015
Sequence Learning• Instances of the form x = <x1, x2, x3, …, xT>
• Variable sequence length T
• Learn a transition function f with parameters W:
• f should update hidden state ht and output yt
h0 := 0
for t = 1, 2, 3, …, T:
<yt, ht> = fW(xt, ht-1)
ht-1
xt
ht
yt
f
Sequence Learning
0
x1
h1
z1
f
x2
h2
z2
f
xT
hT
zT
fhT-1…
Equivalent to a T-layer deep network, unrolled in time
Sequence Learning• What should the transition function f be?
!
!
!
!
• At a minimum, we want something non-linear and differentiable
ht-1
xt
ht
zt
f
Sequence Learning• A “vanilla” RNN:
ht = σ(Whxxt + Whhht-1 + bh)
zt = σ(Whzht + bz)
• Problems
• Difficult to train — vanishing/exploding gradients
• Unable to “select” inputs, hidden state, outputs
Sequence Learning
Long Short-Term Memory (LSTM) Proposed by Hochreiter and Schmidhuber, 1997
Sequence Learning
LSTM (Hochreiter &
Schmidhuber, 1997)
• Allows long-term dependencies to be learned
• Effective for • speech
recognition • handwriting
recognition • translation • parsing
Sequence Learning
LSTM (Hochreiter &
Schmidhuber, 1997)
1 ct-1
ct-10 0
Exactly remember previous
cell state — discard input
Sequence Learning
LSTM (Hochreiter &
Schmidhuber, 1997)
0 0
1 Wxt
Forget previous
cell state — only remember
input
Activity Recognition
LSTM
CNN
LSTM
CNN
LSTM
CNN
LSTM
CNN
talking yawning phoning phoning
Average
phoning
sequential input
Image Description
CNN
LSTM LSTM LSTM LSTM LSTMsequential
outputa<BOS> man is talking <EOS>
Video Description
LSTM
CNN
LSTM
LSTM
LSTM
LSTM LSTM LSTM LSTMLSTM
CNN
LSTM
LSTM
CNN
LSTM
LSTM
CNN
LSTM
sequential input & output
<BOS>
LSTM LSTMLSTMLSTM
<EOS>talkingismana
Sequence learning features now available in Caffe. Check out PR #2033
“Unrolled recurrent layers (RNN, LSTM)”
Training Sequence Models• At training time, want the model to predict the next time
step given all previous time steps: p(wt+1 | w1:t) • Example: A bee buzzes.
input output
0 <BOS> a
1 a bee
2 bee buzzes
3 buzzes <EOS>
Sequence Input Format• First input: “cont” (continuation) indicators (T x N) • Second input: data (T x N x D)
a dog fetches <EOS> the bee
0 1 1 1 0 1
cat in a hat <EOS> a
0 1 1 1 1 0
buzzes <EOS> a
1 1 0
tree falls <EOS>
1 1 1
N = 2, T = 6batch 1 batch 2…
Sequence Input Format• Inference is exact over infinite batches • Backpropagation approximate — truncated at batch
boundaries
a dog fetches <EOS> the bee
0 1 1 1 0 1
cat in a hat <EOS> a
0 1 1 1 1 0
buzzes <EOS> a
1 1 0
tree falls <EOS>
1 1 1
N = 2, T = 6batch 1 batch 2…
Sequence Input Format• Words are usually represented as one-hot vectors
0 0 0 . . 0 1 0 . . 0
“bee”
8000D vector; all 0s except
index of wordvocabulary
with 8000 words
a the am . .
cat bee dog
.
. run
0 1 2 . .
3999 4000 4001
.
. 7999
Sequence Input Format• EmbedLayer projects one-hot vector
0 0 0 . . 0 1 0 . . 0
“bee”
8000D vector; all 0s except
index of word
4000 (index of the word “bee”
in our vocabulary)
layer { type: “Embed” embed_param { input_dim: 8000 num_output: 500 } }
500D embedding
of “bee”
2.7e-2 1.5e-1 3.1e+2
…
Image Description
Image Description
Video Description
LSTM
CNN
LSTM
LSTM
LSTM
LSTM LSTM LSTM LSTMLSTM
CNN
LSTM
LSTM
CNN
LSTM
LSTM
CNN
LSTM
<BOS>
LSTM LSTMLSTMLSTM
<EOS>talkingismana
Venugopalan et al., “Sequence to Sequence -- Video to Text,” 2015. http://arxiv.org/abs/1505.00487
N timesteps: watch video M timesteps: produce caption