Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Modeling Images, Videos and Text

Using the Caffe Deep Learning

Library (Part 2)

Kate Saenko

Microsoft Summer Machine Learning School, St Petersburg 2015

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Intro to Caffe

http://tutorial.caffe.berkeleyvision.org/

Caffe Tour Follow along with the tutorial

slides!

http://tutorial.caffe.berkeleyvision.org/

CaffeTutorial.pptx

CaffeTutorial.pptx

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Sequence Learning • Instances of the form x = <x1, x2, x3, …, xT>

• Variable sequence length T

• Learn a transition function f with parameters W:

• f should update hidden state ht and output yt

h0 := 0

for t = 1, 2, 3, …, T:

<yt, ht> = fW(xt, ht-1)

ht-1

xt

ht

yt

f

Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Sequence Learning

0

x1

h1

z1

f

x2

h2

z2

f

xT

hT

zT

f hT-1 …

Equivalent to a T-layer deep network, unrolled in time


Sequence Learning • What should the transition function f be?

• At a minimum, we want something non-linear and

differentiable

ht-1

xt

ht

zt

f


Sequence Learning • A “vanilla” RNN:

ht = σ(Whxxt + Whhht-1 + bh)

zt = σ(Whzht + bz)

• Problems

• Difficult to train — vanishing/exploding gradients

• Unable to “select” inputs, hidden state, outputs

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

• RNNs have vanishing gradient

problem

• Solution: LSTM by Hochreiter &

Schmidhuber (1997)

• Keeps information for a long time

• memory cell:

– Keep gate=1 maintains state

– write gate allows input

– read gate allows output

– differentiable

output to

rest of

RNN

input from

rest of

RNN

read

gate

write

gate

keep

gate

1.73

Background - LSTM Unit

based on slide by Geoff Hinton

read

1

write

0

keep

1

1.7

read

0

write

0

1.7

read

0

write

1

1.7

1.7 1.7

keep

1

keep

0

keep

0

time

Background - LSTM Unit

based on slide by Geoff Hinton

Sequence Learning

Long Short-Term Memory (LSTM)

Proposed by Hochreiter and Schmidhuber, 1997

Sequence Learning

LSTM

(Hochreiter &

Schmidhuber, 1997)

• Allows long-term

dependencies to be

learned

• Effective for

• speech recognition

• handwriting

recognition

• translation

• parsing

Sequence Learning

LSTM

(Hochreiter &

Schmidhuber, 1997)

1 ct-1

ct-1 0 0

Exactly

remember

previous

cell state —

discard input

Jeff Donahue

Sequence Learning

LSTM

(Hochreiter &

Schmidhuber, 1997)

0 0

1 Wxt

Forget

previous

cell state —

only remember

input

Jeff Donahue

Image Description

CNN

LSTM LSTM LSTM LSTM LSTM sequential

output

a <BOS> man is talking <EOS>


Video Description

LSTM

CNN

LSTM

LSTM

LSTM

LSTM LSTM LSTM LSTM LSTM

CNN

LSTM

LSTM

CNN

LSTM

LSTM

CNN

LSTM

sequential input & output

<BOS>

LSTM LSTM LSTM LSTM

<EOS> talking is man a


Sequence learning features now available in Caffe.

Check out PR #2033 “Unrolled recurrent layers (RNN, LSTM)”


Training Sequence Models • At training time, want the model to predict the next time

step given all previous time steps: p(wt+1 | w1:t)

• Example: A bee buzzes.

input output

0 <BOS> a

1 a bee

2 bee buzzes

3 buzzes <EOS>


Sequence Input Format • First input: “cont” (continuation) indicators (T x N)

• Second input: data (T x N x D)

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

1 1 0

tree falls <EOS>

1 1 1

N = 2, T = 6

batch 1 batch 2…


Sequence Input Format • Inference is exact over infinite batches

• Backpropagation approximate — truncated at batch

boundaries

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

1 1 0

tree falls <EOS>

1 1 1

N = 2, T = 6

batch 1 batch 2…


Sequence Input Format • Words are usually represented as one-hot vectors

0

0

0

.

.

0

1

0

.

.

0

“bee”

8000D vector;

all 0s except

index of

word vocabulary

with 8000

words

a

the

am

.

.

cat

bee

dog

.

.

run

0

1

2

.

.

3999

4000

4001

.

.

7999


Sequence Input Format • EmbedLayer projects one-hot vector

0

0

0

.

.

0

1

0

.

.

0

“bee”

8000D vector;

all 0s except

index of

word

4000

(index of the

word “bee”

in our

vocabulary)

layer {

type: “Embed”

embed_param {

input_dim: 8000

num_output: 500

}

}

500D

embedding

of “bee”

2.7e-2

1.5e-1

3.1e+2

…


Image Description


Video Description

LSTM

CNN

LSTM

LSTM

LSTM

LSTM LSTM LSTM LSTM LSTM

CNN

LSTM

LSTM

CNN

LSTM

LSTM

CNN

LSTM

<BOS>

LSTM LSTM LSTM LSTM

<EOS> talking is man a

Venugopalan et al., “Sequence to Sequence -- Video to Text,” 2015.

http://arxiv.org/abs/1505.00487

N timesteps: watch video M timesteps: produce caption

• Jeff Donahue

http://arxiv.org/abs/1505.00487

Language model trained

on image captions notebook


https://github.com/jeffdonahue/caffe/blob/recurrent-rebase-cleanup/Caffe language model.ipynb

Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Science

Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)