Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

27
Modeling Images, Videos and Text Using the Caffe Deep Learning Library (Part 2) Kate Saenko Microsoft Summer Machine Learning School, St Petersburg 2015

Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Page 1: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Modeling Images, Videos and Text

Using the Caffe Deep Learning

Library (Part 2)

Kate Saenko

Microsoft Summer Machine Learning School, St Petersburg 2015

Page 2: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 3: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 4: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Intro to Caffe

http://tutorial.caffe.berkeleyvision.org/

Caffe Tour Follow along with the tutorial

slides!

Page 5: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 6: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning • Instances of the form x = <x1, x2, x3, …, xT>

• Variable sequence length T

• Learn a transition function f with parameters W:

• f should update hidden state ht and output yt

h0 := 0

for t = 1, 2, 3, …, T:

<yt, ht> = fW(xt, ht-1)

ht-1

xt

ht

yt

f

Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 7: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning

0

x1

h1

z1

f

x2

h2

z2

f

xT

hT

zT

f hT-1 …

Equivalent to a T-layer deep network, unrolled in time

Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 8: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning • What should the transition function f be?

• At a minimum, we want something non-linear and

differentiable

ht-1

xt

ht

zt

f

Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 9: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning • A “vanilla” RNN:

ht = σ(Whxxt + Whhht-1 + bh)

zt = σ(Whzht + bz)

• Problems

• Difficult to train — vanishing/exploding gradients

• Unable to “select” inputs, hidden state, outputs

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 10: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

• RNNs have vanishing gradient

problem

• Solution: LSTM by Hochreiter &

Schmidhuber (1997)

• Keeps information for a long time

• memory cell:

– Keep gate=1 maintains state

– write gate allows input

– read gate allows output

– differentiable

output to

rest of

RNN

input from

rest of

RNN

read

gate

write

gate

keep

gate

1.73

Background - LSTM Unit

based on slide by Geoff Hinton

Page 11: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

read

1

write

0

keep

1

1.7

read

0

write

0

1.7

read

0

write

1

1.7

1.7 1.7

keep

1

keep

0

keep

0

time

Background - LSTM Unit

based on slide by Geoff Hinton

Page 12: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning

Long Short-Term Memory (LSTM)

Proposed by Hochreiter and Schmidhuber, 1997

Page 13: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning

LSTM

(Hochreiter &

Schmidhuber, 1997)

• Allows long-term

dependencies to be

learned

• Effective for

• speech recognition

• handwriting

recognition

• translation

• parsing

Page 14: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning

LSTM

(Hochreiter &

Schmidhuber, 1997)

1 ct-1

ct-1 0 0

Exactly

remember

previous

cell state —

discard input

Jeff Donahue

Page 15: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Learning

LSTM

(Hochreiter &

Schmidhuber, 1997)

0 0

1 Wxt

Forget

previous

cell state —

only remember

input

Jeff Donahue

Page 16: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Image Description

CNN

LSTM LSTM LSTM LSTM LSTM sequential

output

a <BOS> man is talking <EOS>

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 17: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Video Description

LSTM

CNN

LSTM

LSTM

LSTM

LSTM LSTM LSTM LSTM LSTM

CNN

LSTM

LSTM

CNN

LSTM

LSTM

CNN

LSTM

sequential input & output

<BOS>

LSTM LSTM LSTM LSTM

<EOS> talking is man a

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 18: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence learning features now available in Caffe.

Check out PR #2033 “Unrolled recurrent layers (RNN, LSTM)”

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 19: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Training Sequence Models • At training time, want the model to predict the next time

step given all previous time steps: p(wt+1 | w1:t)

• Example: A bee buzzes.

input output

0 <BOS> a

1 a bee

2 bee buzzes

3 buzzes <EOS>

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 20: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Input Format • First input: “cont” (continuation) indicators (T x N)

• Second input: data (T x N x D)

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

1 1 0

tree falls <EOS>

1 1 1

N = 2, T = 6

batch 1 batch 2…

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 21: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Input Format • Inference is exact over infinite batches

• Backpropagation approximate — truncated at batch

boundaries

a dog fetches <EOS> the bee

0 1 1 1 0 1

cat in a hat <EOS> a

0 1 1 1 1 0

buzzes <EOS> a

1 1 0

tree falls <EOS>

1 1 1

N = 2, T = 6

batch 1 batch 2…

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 22: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Input Format • Words are usually represented as one-hot vectors

0

0

0

.

.

0

1

0

.

.

0

“bee”

8000D vector;

all 0s except

index of

word vocabulary

with 8000

words

a

the

am

.

.

cat

bee

dog

.

.

run

0

1

2

.

.

3999

4000

4001

.

.

7999

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 23: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Sequence Input Format • EmbedLayer projects one-hot vector

0

0

0

.

.

0

1

0

.

.

0

“bee”

8000D vector;

all 0s except

index of

word

4000

(index of the

word “bee”

in our

vocabulary)

layer {

type: “Embed”

embed_param {

input_dim: 8000

num_output: 500

}

}

500D

embedding

of “bee”

2.7e-2

1.5e-1

3.1e+2

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 24: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Image Description

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 25: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Image Description

• Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015

Page 26: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Video Description

LSTM

CNN

LSTM

LSTM

LSTM

LSTM LSTM LSTM LSTM LSTM

CNN

LSTM

LSTM

CNN

LSTM

LSTM

CNN

LSTM

<BOS>

LSTM LSTM LSTM LSTM

<EOS> talking is man a

Venugopalan et al., “Sequence to Sequence -- Video to Text,” 2015.

http://arxiv.org/abs/1505.00487

N timesteps: watch video M timesteps: produce caption

• Jeff Donahue

Page 27: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 2 (by Kate Saenko)

Language model trained

on image captions notebook

Jeff Donahue, CVPR Caffe Tutorial, June 6, 2015