Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

107
Modeling Images, Videos and Text Using the Caffe Deep Learning Library (Part 1) Kate Saenko Microsoft Summer Machine Learning School, St Petersburg 2015

Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Page 1: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Modeling Images, Videos and Text

Using the Caffe Deep Learning Library

(Part 1)

Kate Saenko

Microsoft Summer Machine Learning School, St Petersburg 2015

Page 2: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

about me

BOSTON, Massachusetts

Page 3: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 4: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 5: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Machine Learning: What is it?

• Program a computer to learn from experience

• Learn from “big data”

Page 6: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Machine Learning: It is used in more ways than you think!

Page 7: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Computer Vision: Teach Machine to “See” Like a Human

Page 8: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Terminator 2

Hollywood version…

terminator 2, enemy of the state (from UCSD “Fact or Fiction” DVD)

Computer Vision: Teach Machine to “See” Like a Human

Page 9: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Computer Vision in Real Life:

Face Tagging in Social Media

Page 10: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Computer Vision in Real Life: Surveillance and Security

Page 11: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

• Stanford/Google one of the first to develop self-driving cars

• Cars “see” using many sensors: radar, laser, cameras

Computer Vision in Real Life:

Smart Cars

Page 12: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Computer Vision in Real Life:

Scientific Images

Page 13: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Image guided surgery

Grimson et al., MIT 3D imaging

MRI, CT

slide by S. Seitz

Computer Vision in Real Life:

Medical Imaging

Page 14: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

http://www.robocup.org/

NASA’s Mars Spirit Rover

http://en.wikipedia.org/wiki/Spirit_rover

slide by S. Seitz

Computer Vision in Real Life:

Robot Vision

Page 15: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Computer Vision in Real Life:

many other applications! 3D Shape Analysis

3D Face Reconstruction

http://grail.cs.washington.edu/projects/totalmoving/

Handwriting Recognition

3D Panoramas

Page 16: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

How Do We Do It?

Page 17: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Computer Vision: Machine Learning from Big Data

Artificial Neural Network

Support Vector Machine

Page 18: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Machine Learning from Big Data: Achievements

Artificial Neural Network

Page 19: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 20: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

http://arxiv.org/abs/1411.4389

Image Description

Page 21: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Video Description

Output: A woman shredding chicken in a kitchen

Input

video:

Page 22: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Social media analysis

A person dancing in a studio Machine sharpening a pencil

Ballerina dancing on stage Man playing guitar Woman chopping onion

Train passing by Mt. Fuji

Petabytes of video, very

little text

Page 24: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Social media: summarization

A car is driving

down a road A man is riding a

bike through the

woods

A skateboarder

jumps and falls

Page 25: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Surveillance and Security

Smart camera alerts:

A woman wearing a

red coat walked past

A woman carrying a

large bag entered a

building

Page 26: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Question answering

How many times did Darth Vader use a light saber?

Page 27: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Assistive technology

Descriptive Video Service (DVS)

Page 28: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Object/action detection not enough

• Does not model interaction

between entities and scene

• Does not model what is

important to say

• Natural language is much

richer

Page 29: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Challenges

Object detection • YouTube dataset has 900+ objects

• Most test objects do NOT appear in training

Page 30: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 31: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Dealing with uncertainty in

“in-the-wild” YouTube video

Guadarrama, Krishnamoorthy, Malkarnenkar, Venugopalan, Mooney, Darrell,

and Saenko. 2013. Youtube2text: Recognizing and describing arbitrary

activities using semantic hierarchies and zero-shot recognition. In IEEE

International Conference on Computer Vision (ICCV).

Page 32: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

A is a .

SUBJECT VERB OBJECT person ride motorbike

<SUBJECT> <VERB> <OBJECT> A person is riding a motorbike .

A template model

Page 33: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

OBJECT DETECTIONS

cow 0.11 person 0.42

table 0.07

aeroplane 0.05 dog 0.15

motorbike 0.51 train 0.17

car 0.29

Page 34: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

SORTED OBJECT DETECTIONS

motorbike 0.51

person 0.42

car 0.29

aeroplane 0.05

… …

Page 35: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

VERB DETECTIONS

hold 0.23 drink 0.11

move 0.34

dance 0.05 slice 0.13

climb 0.17 shoot 0.07

ride 0.19

Page 36: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

SORTED VERB DETECTIONS

move 0.34

hold 0.23

ride 0.19

dance 0.05

… …

Page 37: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

SORTED VERB DETECTIONS

move 0.34

hold 0.23

ride 0.19

dance 0.05

… …

motorbike 0.51

person 0.42

car 0.29

aeroplane 0.05

… …

SORTED OBJECT DETECTIONS

Page 38: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Vision pipeline

Page 39: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Input

Video:

Output sentence: “Woman sharpens baby”

Problem: detection mistakes

SORTED VERB DETECTIONS

sharpen 0.34

cut 0.23

… …

woman 0.51

baby 0.42

… …

SORTED OBJECT DETECTIONS

Page 40: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Idea: trade off accuracy and specificity

Learn hierarchies from S, V, O co-occurrence

Page 41: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

input

video

being

person animal

most specific prediction

“Woman sharpens baby”

do

work play

baby …

knife … … …

sharpen clamp woman man … …

entity

person tool

… …

“Man clamps knife”, “Person clamps knife, “Man working” Humans:

Subject Verb Object

Page 42: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

input

video

being

person animal

do

work play

baby …

knife … … …

sharpen clamp woman man … …

entity

person tool

… …

“Man clamps knife”, “Person clamps knife, “Man working” Humans:

Subject Verb Object

our prediction

“Person working with a tool”

Page 43: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

collected for paraphrase and machine translation

2,089 YouTube videos with 122K multi-lingual descriptions.

Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/

Microsoft YouTube Dataset (Chen & Dolan, ACL 2011)

Page 44: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Microsoft YouTube Dataset (Chen & Dolan, ACL 2011) (a) Hollywood (8 actions) (b) TRECVID MED (6 actions)

(c) YouTube (218 actions)

A train is rolling by.A train passes by Mount Fuji.A bullet train zooms through the countryside.A train is coming down the tracks.

A man is sitting and playing a guitarA man is playing guitarStreet artists play guitar.A man is playing a guitar.a lady is playing the guitar.

A woman is cooking onions.Someone is cooking in a pan.someone preparing somethinga person coking.racipe for katsu curry

A girl is ballet dancing.A girl is dancing on a stage.A girl is performing as a ballerina.A woman dances.

We cluster words to obtain about 200 verbs and 300 nouns.

Page 45: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Video Collection Task

• Asked Amazon Mechanical Turk workers to submit video clips from YouTube

• Single, unambiguous action/event

• Short (4-10 seconds)

• Generally accessible

• No dialogue

• No words (subtitles, overlaid text, titles)

Page 46: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Generalization results on MSFT Youtube

Page 47: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Generalization results on MSFT Youtube

Page 48: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Challenges

Object detection • YouTube dataset has 900+ objects

• Most test objects do NOT appear in training

Need to model language • Syntax, semantics, common sense

• can a squirrel drive a car? an onion play guitar?

Page 49: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 50: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Modeling common

sense J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney.

Integrating language and vision to gener- ate natural language descriptions of

videos in the wild. In Proceedings of the 25th International Conference on

Computational Linguistics (COLING), August 2014.

Page 51: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Input

Video:

• Output sentence: “Woman sharpens baby”

• Common sense: babies cannot be

sharpened

• Idea: learn common SVO statistics from

very large text-only corpora

?

Problem: no common sense

Page 52: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Adding a linguistic prior using a

Factor Graph Model (FGM)

Visual confidence values are observed (gray potentials) and inform sentence components.

Language potentials (dashed) connect latent words between sentence components.

Page 53: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Mining Text Corpora Corpora Size of text parsed

British National Corpus (BNC) 1.5GB

GigaWord 26GB

ukWaC 5.5GB

WaCkypedia_EN 2.6GB

GoogleNgrams 1012 words

Stanford dependency parses from first 4 corpora used to build

SVO language model.

Full language model used for surface realization trained on

GoogleNgrams using BerkeleyLM

Page 54: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Evaluation

● Subject, Verb, Object accuracy o Compare to SVO extracted from ground truth

sentences

o “A woman shredding chicken in a kitchen”

o Most Common SVO, or Any Valid SVO

S V O

Page 55: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results of SVO prediction

HVC: highest vision confidence

GFM: factor graph with language prior

Binary accuracy of “Most Common” SVO

Page 56: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Challenges

Object detection • YouTube dataset has 900+ objects

• Most test objects do NOT appear in training

Need to model language • Syntax, semantics, common sense

• can a squirrel drive a car? an onion play guitar?

Sequence-to-sequence • Both input AND output are sequences

• So far we have assumed video and sentence are both fixed length

• What are good features? Can we learn them?

Page 57: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 58: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Neurons in the Brain

Artificial Neural Network

Page 59: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Neuron in the brain

“Input wire”

“Output wire”

Page 60: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

0

-2

+4

0

+2

Input

Multiply by

weights

Sum Threshold

Output

Artificial Neuron

Page 61: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

+4

+2

0

+3

-2

0

-2

+4

0

+2

-8 0

Artificial Neuron: Activation

Input

Multiply by

weights

Sum Threshold

Page 62: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

+4

+2

0

+3

-2

0

-2

+4

0

+2

-8 0

0

-2

0

+2

+2

0

-2

+4

0

+2

+8 +8

Artificial Neuron: Activation

Input

Multiply by

weights

Sum Threshold

Neurons learn

patterns!

Page 63: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

1

Artificial Neuron: Pattern Classification

Input

+4

+2

0

-3

-2

class

0

Input

+4

+2

0

-3

-2

class

va

lues d

ecre

ase

va

lues d

ecre

ase

• Classify input into

class 0 or 1

• Teach neuron to

predict correct

class label

Example

Page 64: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

+4

+2

0

-3

-2

0

-2

+4

0

+2

-4 0

Artificial Neuron: Learning

Input

Multiply by

weights

Sum Threshold

1

activation class

Adjust

weights

= =

Page 65: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

+4

+2

0

-3

-2

-2

+4

0

+1

0 0

Artificial Neuron: Learning

Input

Multiply by

weights

Sum Threshold

1

activation class

+1

Adjust

weights

= =

Page 66: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

+4

+2

0

-3

-2

0

+4

0

+1

+2 1

Artificial Neuron: Learning

Input

Multiply by

weights

Sum Threshold

1

activation class

+1

Adjust

weights

= =

Page 67: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Input

Output

Weights

Artificial Neural Network

Simplify

Page 68: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Input

Output

Artificial Neural Network Hidden Layer Input Layer Output Layer

Deep Network: many

hidden layers!

Single Neuron Neural Network

Page 69: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Artificial Neural Network Hidden Layer Input Layer Output Layer

x1

x2

x3

x4

x5

a1

a2

a3

h1

h2

h3

𝑥 =

𝑥1…𝑥5

Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35

ℎΘ(x) = 𝑔(Θ(2)𝑎)

a = 𝑔(Θ(1)𝑥)

Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33

weights

input

hidden layer activations

𝑔 𝑧 =1

1 + exp(−𝑧)

output

1

0.5

0

Page 70: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Artificial Neural Network

𝑥 =

𝑥1…𝑥5

Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35

ℎΘ(x) = 𝑔(Θ(2)𝑎)

a = 𝑔(Θ(1)𝑥)

Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33

weights

input

hidden layer activations

𝑔 𝑧 =1

1 + exp(−𝑧)

output

1

0.5

0

ℎΘ(x) = estimated probability that class=1

for input x

𝑔 𝑧

𝑧

ℎΘ x = 0.2

ℎΘ x = 0.8

predict class=0

predict class=1

Page 71: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Layer 3 Layer 1 Layer 2 Layer 4

Network architectures

Recurrent

Convolutional

input

input

inp

ut

hid

de

n

hid

de

n

hid

den

ou

tpu

t

ou

tpu

t

ou

tpu

t time

Page 72: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Representing Images

Input Layer

Reshape into

a vector

Page 73: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

1 0 -1

1 0 -1

1 0 -1

Convolve with Threshold

Convolutional Neural Network

Input Layer

Output Layer

slide by Abi-Roozgard

Page 74: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

w13 w12 w11

w23 w22 w21

w33 w32 w31

1 0 -1

1 0 -1

1 0 -1

Convolve with Threshold

w13 w12 w11

w23 w22 w21

w33 w32 w31

Convolutional Neural Network

slide by Abi-Roozgard

Input Layer

Output Layer

Page 75: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Convolutional Neural Network

LeNet

Page 76: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Why Deep Learning? The Unreasonable Effectiveness of Deep Features

Rich visual structure of features deep in hierarchy.

[R-CNN]

[Zeiler-Fergus]

Maximal activations of pool5 units

conv5 DeConv visualization

Page 77: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Page 78: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Deep Convolutional-

Recurrent Network Translating Videos to Natural Language Using Deep Recurrent Neural

Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus

Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015

Long-term Recurrent Convolutional Networks for Visual Recognition and

Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus

Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.

Page 79: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Deep Convolutional Neural Networks

• Learn robust high-level image representations

• achieve state-of-the-art results on image tasks

• But, do not handle sequences

• Idea: combine with Recurrent Neural Network

Recurrent

Convolutional

inp

ut

inp

ut

inp

ut

hid

den

hid

den

hid

den

ou

tpu

t

ou

tpu

t

ou

tpu

t

time

Page 80: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Contributions

● End-to-end deep video description o deep image and language model

● Leverage still image caption data

Page 81: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Recurrent neural networks

• Distributed hidden state stores

past information efficiently

• Non-linear dynamics

• with enough neurons and time,

can compute any function

input

input

input

hid

den

hid

den

hid

den

outp

ut

outp

ut

outp

ut

time

based on slide by Geoff Hinton

Page 82: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Model

CNN

1. Extract deep features from each frame

2. Create a fixed length vector representation of the video

input frames

Sentence

3. Decode the vector to a sentence

input

input

input

hid

den

hid

den

hid

den

outp

ut

outp

ut

outp

ut

RNN

Page 83: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

● Easier to train than RNN

● Impressive results for

○ speech recognition

○ handwriting

recognition

○ translation

Background - LSTM Unit

● Our model o 2 layers of LSTM units (hidden state

of first is input to second)

o Output- Softmax - probability

distribution over vocabulary of words

Page 84: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Input Video Convolutional Net Recurrent Net Output

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

A

. . .

boy

is

playing

a

ball

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

mea

n p

oo

ling

CNN

CNN

CNN

CNN

CNN

. . .

Page 85: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Evaluation

● Subject, Verb, Object accuracy o SVO extracted from the generated sentence

o “A woman shredding chicken in a kitchen”

o Most Common or Any Valid

● BLEU

● METEOR

● Human Evaluation

S V O

Page 86: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results - SVO (Binary, Most Common)

Page 87: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results - Generation

Model BLEU METEOR

FGM 13.68 23.9

LSTMFlickr 10.29 19.52

LSTMCOCO 12.66 20.96

LSTM-YT 31.19 26.87

LSTM-YTFlickr 32.03 27.87

LSTM-YTCOCO 33.29 29.07

LSTM-YTCOCO+Flickr 33.29 28.88 More fluent, but not enough training data in Youtube dataset

to train good language model

Page 88: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Idea: pre-learn on still image captions

Dataset Train Validation Test

Flickr30k ~28000 1000 1000

COCO2014 82783 40504 -

YouTube 1200 100 670

Page 89: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

CNN

Pre-Train on Still Images (Flickr30k and COCO2014)

A

man

is

scaling

a

cliff

Input Image Convolutional Net Recurrent Net Output

Page 90: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Input Image Convolutional Net Recurrent Net Output

Page 91: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

mea

n p

oo

ling

Input Video Convolutional Net Recurrent Net Output

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

A

. . .

boy

is

playing

a

ball

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

CNN

CNN

CNN

CNN

CNN

. . .

Fine-Tune on Videos

Page 92: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results - SVO (Binary, Most Common)

See also Guadarrama et al. ICCV 2013

Page 93: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results - Generation

Model BLEU METEOR

FGM 13.68 23.9

LSTMFlickr 10.29 19.52

LSTMCOCO 12.66 20.96

LSTM-YT 31.19 26.87

LSTM-YTFlickr 32.03 27.87

LSTM-YTCOCO 33.29 29.07

LSTM-YTCOCO+Flickr 33.29 28.88

Page 94: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results - Human Eval.

Model Relevance Grammar

FGM 3.20 3.99

LSTM-YT 2.88 3.84

LSTM-YTCOCO 2.83 3.46

LSTM-YTCOCO+Flickr - 3.64

GroundTruth 1.10 4.61

Page 95: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Generated image captions

http://arxiv.org/abs/1411.4389

Page 96: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

http://arxiv.org/abs/1411.4389

Generated image captions

Page 97: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Sequence-to-Sequence

Video-to-Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond

Mooney, Trevor Darrell, Kate Saenko; arXiv 2015

Page 98: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Sequence to Sequence Video to Text

LSTM

111

Page 99: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

MSVD dataset (youtube videos)

Results

11

2

Movie description datasets

Page 100: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Qualitative results Results

11

3

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

ICCV

#1211

ICCV

#1211

ICCV 2015 Submission #1211. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 3. MSVD YouTube video dataset. We present examples where S2VT model (RGB on VGG net) generates correct descriptions

involving different objects and actions for several videos (in column a). The center column (b) shows examples where the model predicts

relevant but incorrect descriptions. The last column (c) shows examples where the model generates descriptions that are irrelevant to the

event in the video.

Figure4. M-VAD Moviecorpus: Weshow arepresentativeframefrom 6 contiguousclips from themovie “BigMommas: LikeFather, Like

Son”. Soft Attention (GNet +3D-Conv) aresentences from themodel in [40]. S2VT (MPII+MVAD) represents thesentences generated by

our model trained on both the MPII and M-VAD datasets. DVS represents the original ground truth sentences in the dataset for each clip.

[4] D. L. Chen and W. B. Dolan. Collecting highly parallel data

for paraphrase evaluation. In ACL, Portland, Oregon, USA,

June 2011. 2, 5

[5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-

lar, and C. L. Zitnick. Microsoft coco captions: Data collec-

tion and evaluation server. arXiv preprint arXiv:1504.00325,

2015. 6

[6] X. Chen and C. L. Zitnick. Learning a recurrent visual rep-

resentation for image caption generation. arXiv:1411.5654,

2014. 1

8

Page 101: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results

Page 102: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results

Page 103: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Results

Page 104: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Qualitative results on Hollywood Movies

11

7

(1) (2) (3) (4) (5) (6a) (6b)

Page 105: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)
Page 106: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)
Page 107: Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Thanks

Subhashini

Venugopalan Huijuan

Xu

Jeff

Donahue

Marcus

Rohrbach

Raymond

Mooney Trevor Darrell

UT Austin UMass Lowell UC Berkeley UC Berkeley UT Austin UC Berkeley

References [1] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to gener- ate

natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational

Linguistics (COLING), August 2014.

[2] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor

Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and

zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV).

[3] Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Subhashini Venugopalan, Huijun Xu, Jeff

Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015

[4] Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Jeff Donahue, Lisa Hendricks, Sergio

Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.

[5] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko; arXiv 2015

Sergio

Guadarrama

Google