Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Modeling Images, Videos and Text

Using the Caffe Deep Learning Library

(Part 1)

Kate Saenko

Microsoft Summer Machine Learning School, St Petersburg 2015

about me

BOSTON, Massachusetts

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

Machine Learning: What is it?

• Program a computer to learn from experience

• Learn from “big data”

Machine Learning: It is used in more ways than you think!

Computer Vision: Teach Machine to “See” Like a Human

Terminator 2

Hollywood version…

terminator 2, enemy of the state (from UCSD “Fact or Fiction” DVD)

Computer Vision: Teach Machine to “See” Like a Human

Computer Vision in Real Life:

Face Tagging in Social Media

Computer Vision in Real Life: Surveillance and Security

• Stanford/Google one of the first to develop self-driving cars

• Cars “see” using many sensors: radar, laser, cameras


Smart Cars


Scientific Images

Image guided surgery

Grimson et al., MIT 3D imaging

MRI, CT

slide by S. Seitz


Medical Imaging

http://groups.csail.mit.edu/vision/medical-vision/surgery/surgical_navigation.html

http://www.robocup.org/

NASA’s Mars Spirit Rover

http://en.wikipedia.org/wiki/Spirit_rover

slide by S. Seitz


Robot Vision

UNSW_CMU.mpg

http://www.robocup.org/

http://upload.wikimedia.org/wikipedia/commons/d/d8/NASA_Mars_Rover.jpg

http://en.wikipedia.org/wiki/Spirit_rover


many other applications! 3D Shape Analysis

3D Face Reconstruction

http://grail.cs.washington.edu/projects/totalmoving/

Handwriting Recognition

3D Panoramas

http://grail.cs.washington.edu/projects/totalmoving/

How Do We Do It?

Computer Vision: Machine Learning from Big Data

Artificial Neural Network

Support Vector Machine

http://www.picsearch.com/

Machine Learning from Big Data: Achievements


PART I

INTRODUCTION


MODELING IMAGES

MODELING LANGUAGE



PART II

INTRO TO CAFFE


http://arxiv.org/abs/1411.4389

Image Description

Video Description

Output: A woman shredding chicken in a kitchen

Input

video:

Social media analysis

A person dancing in a studio Machine sharpening a pencil

Ballerina dancing on stage Man playing guitar Woman chopping onion

Train passing by Mt. Fuji

Petabytes of video, very

little text

Social media: retrieval

Show me all video clips of a person playing guitar and singing

http://youtube.com/v/_8KW6oVtHE8

http://youtube.com/v/LXujsIv6sO8

http://youtube.com/v/ik9ZrkdVEw0

http://youtube.com/v/q6T8bPMaNxg

http://youtube.com/v/QM9aaryYcsk

http://youtube.com/v/O1TdNouqbqY

Social media: summarization

A car is driving

down a road A man is riding a

bike through the

woods

A skateboarder

jumps and falls

Surveillance and Security

Smart camera alerts:

A woman wearing a

red coat walked past

A woman carrying a

large bag entered a

building

Question answering

How many times did Darth Vader use a light saber?

Assistive technology

Descriptive Video Service (DVS)

Object/action detection not enough

• Does not model interaction

between entities and scene

• Does not model what is

important to say

• Natural language is much

richer

Challenges

Object detection • YouTube dataset has 900+ objects

• Most test objects do NOT appear in training

PART I

INTRODUCTION


MODELING IMAGES

MODELING LANGUAGE



PART II

INTRO TO CAFFE


Dealing with uncertainty in

“in-the-wild” YouTube video

Guadarrama, Krishnamoorthy, Malkarnenkar, Venugopalan, Mooney, Darrell,

and Saenko. 2013. Youtube2text: Recognizing and describing arbitrary

activities using semantic hierarchies and zero-shot recognition. In IEEE

International Conference on Computer Vision (ICCV).

A is a .

SUBJECT VERB OBJECT person ride motorbike

<SUBJECT> <VERB> <OBJECT> A person is riding a motorbike .

A template model

OBJECT DETECTIONS

cow 0.11 person 0.42

table 0.07

aeroplane 0.05 dog 0.15

motorbike 0.51 train 0.17

car 0.29

SORTED OBJECT DETECTIONS

motorbike 0.51

person 0.42

car 0.29

aeroplane 0.05

… …

VERB DETECTIONS

hold 0.23 drink 0.11

move 0.34

dance 0.05 slice 0.13

climb 0.17 shoot 0.07

ride 0.19

SORTED VERB DETECTIONS

move 0.34

hold 0.23

ride 0.19

dance 0.05

… …


move 0.34

hold 0.23

ride 0.19

dance 0.05

… …

motorbike 0.51

person 0.42

car 0.29

aeroplane 0.05

… …


Vision pipeline

Input

Video:

Output sentence: “Woman sharpens baby”

Problem: detection mistakes


sharpen 0.34

cut 0.23

… …

woman 0.51

baby 0.42

… …


Idea: trade off accuracy and specificity

Learn hierarchies from S, V, O co-occurrence

input

video

being

person animal

…

most specific prediction

“Woman sharpens baby”

do

work play

…

baby …

knife … … …

sharpen clamp woman man … …

entity

person tool

… …

“Man clamps knife”, “Person clamps knife, “Man working” Humans:

Subject Verb Object

input

video

being

person animal

…

do

work play

…

baby …

knife … … …

sharpen clamp woman man … …

entity

person tool

… …

“Man clamps knife”, “Person clamps knife, “Man working” Humans:

Subject Verb Object

our prediction

“Person working with a tool”

collected for paraphrase and machine translation

2,089 YouTube videos with 122K multi-lingual descriptions.

Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/

Microsoft YouTube Dataset (Chen & Dolan, ACL 2011)

http://www.cs.utexas.edu/users/ml/clamp/videoDescription/

Microsoft YouTube Dataset (Chen & Dolan, ACL 2011) (a) Hollywood (8 actions) (b) TRECVID MED (6 actions)

(c) YouTube (218 actions)

A train is rolling by.A train passes by Mount Fuji.A bullet train zooms through the countryside.A train is coming down the tracks.

A man is sitting and playing a guitarA man is playing guitarStreet artists play guitar.A man is playing a guitar.a lady is playing the guitar.

A woman is cooking onions.Someone is cooking in a pan.someone preparing somethinga person coking.racipe for katsu curry

A girl is ballet dancing.A girl is dancing on a stage.A girl is performing as a ballerina.A woman dances.

We cluster words to obtain about 200 verbs and 300 nouns.

Video Collection Task

• Asked Amazon Mechanical Turk workers to submit video clips from YouTube

• Single, unambiguous action/event

• Short (4-10 seconds)

• Generally accessible

• No dialogue

• No words (subtitles, overlaid text, titles)

Generalization results on MSFT Youtube

Challenges



Need to model language • Syntax, semantics, common sense

• can a squirrel drive a car? an onion play guitar?

PART I

INTRODUCTION


MODELING IMAGES

MODELING LANGUAGE



PART II

INTRO TO CAFFE


Modeling common

sense J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney.

Integrating language and vision to gener- ate natural language descriptions of

videos in the wild. In Proceedings of the 25th International Conference on

Computational Linguistics (COLING), August 2014.

Input

Video:

• Output sentence: “Woman sharpens baby”

• Common sense: babies cannot be

sharpened

• Idea: learn common SVO statistics from

very large text-only corpora

?

Problem: no common sense

Adding a linguistic prior using a

Factor Graph Model (FGM)

Visual confidence values are observed (gray potentials) and inform sentence components.

Language potentials (dashed) connect latent words between sentence components.

Mining Text Corpora Corpora Size of text parsed

British National Corpus (BNC) 1.5GB

GigaWord 26GB

ukWaC 5.5GB

WaCkypedia_EN 2.6GB

GoogleNgrams 1012 words

Stanford dependency parses from first 4 corpora used to build

SVO language model.

Full language model used for surface realization trained on

GoogleNgrams using BerkeleyLM

Evaluation

● Subject, Verb, Object accuracy o Compare to SVO extracted from ground truth

sentences

o “A woman shredding chicken in a kitchen”

o Most Common SVO, or Any Valid SVO

S V O

Results of SVO prediction

HVC: highest vision confidence

GFM: factor graph with language prior

Binary accuracy of “Most Common” SVO

Challenges



Need to model language • Syntax, semantics, common sense

• can a squirrel drive a car? an onion play guitar?

Sequence-to-sequence • Both input AND output are sequences

• So far we have assumed video and sentence are both fixed length

• What are good features? Can we learn them?

PART I

INTRODUCTION


MODELING IMAGES

MODELING LANGUAGE



PART II

INTRO TO CAFFE


Neurons in the Brain


Neuron in the brain

“Input wire”

“Output wire”

0

-2

+4

0

+2

Input

Multiply by

weights

Sum Threshold

Output

Artificial Neuron

+4

+2

0

+3

-2

0

-2

+4

0

+2

-8 0

Artificial Neuron: Activation

Input

Multiply by

weights

Sum Threshold

+4

+2

0

+3

-2

0

-2

+4

0

+2

-8 0

0

-2

0

+2

+2

0

-2

+4

0

+2

+8 +8

Artificial Neuron: Activation

Input

Multiply by

weights

Sum Threshold

Neurons learn

patterns!

1

Artificial Neuron: Pattern Classification

Input

+4

+2

0

-3

-2

class

0

Input

+4

+2

0

-3

-2

class

va

lues d

ecre

ase

va

lues d

ecre

ase

• Classify input into

class 0 or 1

• Teach neuron to

predict correct

class label

Example

+4

+2

0

-3

-2

0

-2

+4

0

+2

-4 0

Artificial Neuron: Learning

Input

Multiply by

weights

Sum Threshold

1

activation class

Adjust

weights

= =

+4

+2

0

-3

-2

-2

+4

0

+1

0 0


Input

Multiply by

weights

Sum Threshold

1

activation class

+1

Adjust

weights

= =

+4

+2

0

-3

-2

0

+4

0

+1

+2 1


Input

Multiply by

weights

Sum Threshold

1

activation class

+1

Adjust

weights

= =

Input

Output

Weights


Simplify

Input

Output

Artificial Neural Network Hidden Layer Input Layer Output Layer

Deep Network: many

hidden layers!

Single Neuron Neural Network

Artificial Neural Network Hidden Layer Input Layer Output Layer

x1

x2

x3

x4

x5

a1

a2

a3

h1

h2

h3

𝑥 =

𝑥1…𝑥5

Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35

ℎΘ(x) = 𝑔(Θ(2)𝑎)

a = 𝑔(Θ(1)𝑥)

Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33

weights

input

hidden layer activations

𝑔 𝑧 =1

1 + exp(−𝑧)

output

1

0.5

0


𝑥 =

𝑥1…𝑥5

Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35

ℎΘ(x) = 𝑔(Θ(2)𝑎)

a = 𝑔(Θ(1)𝑥)

Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33

weights

input

hidden layer activations

𝑔 𝑧 =1

1 + exp(−𝑧)

output

1

0.5

0

ℎΘ(x) = estimated probability that class=1

for input x

𝑔 𝑧

𝑧

ℎΘ x = 0.2

ℎΘ x = 0.8

predict class=0

predict class=1

Layer 3 Layer 1 Layer 2 Layer 4

Network architectures

Recurrent

Convolutional

input

input

inp

ut

hid

de

n

hid

de

n

hid

den

ou

tpu

t

ou

tpu

t

ou

tpu

t time

Representing Images

Input Layer

Reshape into

a vector

1 0 -1

1 0 -1

1 0 -1

Convolve with Threshold

Convolutional Neural Network

Input Layer

Output Layer

slide by Abi-Roozgard

w13 w12 w11

w23 w22 w21

w33 w32 w31

1 0 -1

1 0 -1

1 0 -1

Convolve with Threshold

w13 w12 w11

w23 w22 w21

w33 w32 w31


slide by Abi-Roozgard

Input Layer

Output Layer


LeNet

Why Deep Learning? The Unreasonable Effectiveness of Deep Features

Rich visual structure of features deep in hierarchy.

[R-CNN]

[Zeiler-Fergus]

Maximal activations of pool5 units

conv5 DeConv visualization

PART I

INTRODUCTION


MODELING IMAGES

MODELING LANGUAGE



PART II

INTRO TO CAFFE


Deep Convolutional-

Recurrent Network Translating Videos to Natural Language Using Deep Recurrent Neural

Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus

Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015

Long-term Recurrent Convolutional Networks for Visual Recognition and

Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus

Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.

Deep Convolutional Neural Networks

• Learn robust high-level image representations

• achieve state-of-the-art results on image tasks

• But, do not handle sequences

• Idea: combine with Recurrent Neural Network

Recurrent

Convolutional

inp

ut

inp

ut

inp

ut

hid

den

hid

den

hid

den

ou

tpu

t

ou

tpu

t

ou

tpu

t

time

Contributions

● End-to-end deep video description o deep image and language model

● Leverage still image caption data

Recurrent neural networks

• Distributed hidden state stores

past information efficiently

• Non-linear dynamics

• with enough neurons and time,

can compute any function

input

input

input

hid

den

hid

den

hid

den

outp

ut

outp

ut

outp

ut

time

based on slide by Geoff Hinton

Model

CNN

1. Extract deep features from each frame

2. Create a fixed length vector representation of the video

input frames

Sentence

3. Decode the vector to a sentence

input

input

input

hid

den

hid

den

hid

den

outp

ut

outp

ut

outp

ut

RNN

● Easier to train than RNN

● Impressive results for

○ speech recognition

○ handwriting

recognition

○ translation

Background - LSTM Unit

● Our model o 2 layers of LSTM units (hidden state

of first is input to second)

o Output- Softmax - probability

distribution over vocabulary of words

Input Video Convolutional Net Recurrent Net Output

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

A

. . .

boy

is

playing

a

ball

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

mea

n p

oo

ling

CNN

CNN

CNN

CNN

CNN

. . .

Evaluation

● Subject, Verb, Object accuracy o SVO extracted from the generated sentence

o “A woman shredding chicken in a kitchen”

o Most Common or Any Valid

● BLEU

● METEOR

● Human Evaluation

S V O

Results - SVO (Binary, Most Common)

Results - Generation

Model BLEU METEOR

FGM 13.68 23.9

LSTMFlickr 10.29 19.52

LSTMCOCO 12.66 20.96

LSTM-YT 31.19 26.87

LSTM-YTFlickr 32.03 27.87

LSTM-YTCOCO 33.29 29.07

LSTM-YTCOCO+Flickr 33.29 28.88 More fluent, but not enough training data in Youtube dataset

to train good language model

Idea: pre-learn on still image captions

Dataset Train Validation Test

Flickr30k ~28000 1000 1000

COCO2014 82783 40504 -

YouTube 1200 100 670

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

CNN

Pre-Train on Still Images (Flickr30k and COCO2014)

A

man

is

scaling

a

cliff

Input Image Convolutional Net Recurrent Net Output

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Input Image Convolutional Net Recurrent Net Output

mea

n p

oo

ling

Input Video Convolutional Net Recurrent Net Output

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

A

. . .

boy

is

playing

a

ball

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

CNN

CNN

CNN

CNN

CNN

. . .

Fine-Tune on Videos

Results - SVO (Binary, Most Common)

See also Guadarrama et al. ICCV 2013

Results - Generation

Model BLEU METEOR

FGM 13.68 23.9

LSTMFlickr 10.29 19.52

LSTMCOCO 12.66 20.96

LSTM-YT 31.19 26.87

LSTM-YTFlickr 32.03 27.87


LSTM-YTCOCO+Flickr 33.29 28.88

Results - Human Eval.

Model Relevance Grammar

FGM 3.20 3.99

LSTM-YT 2.88 3.84


LSTM-YTCOCO+Flickr - 3.64

GroundTruth 1.10 4.61

Generated image captions


Sequence-to-Sequence

Video-to-Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond

Mooney, Trevor Darrell, Kate Saenko; arXiv 2015

Sequence to Sequence Video to Text

LSTM

111

MSVD dataset (youtube videos)

Results

11

2

Movie description datasets

Qualitative results Results

11

3

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

ICCV

#1211

ICCV

#1211

ICCV 2015 Submission #1211. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 3. MSVD YouTube video dataset. We present examples where S2VT model (RGB on VGG net) generates correct descriptions

involving different objects and actions for several videos (in column a). The center column (b) shows examples where the model predicts

relevant but incorrect descriptions. The last column (c) shows examples where the model generates descriptions that are irrelevant to the

event in the video.

Figure4. M-VAD Moviecorpus: Weshow arepresentativeframefrom 6 contiguousclips from themovie “BigMommas: LikeFather, Like

Son”. Soft Attention (GNet +3D-Conv) aresentences from themodel in [40]. S2VT (MPII+MVAD) represents thesentences generated by

our model trained on both the MPII and M-VAD datasets. DVS represents the original ground truth sentences in the dataset for each clip.

[4] D. L. Chen and W. B. Dolan. Collecting highly parallel data

for paraphrase evaluation. In ACL, Portland, Oregon, USA,

June 2011. 2, 5

[5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-

lar, and C. L. Zitnick. Microsoft coco captions: Data collec-

tion and evaluation server. arXiv preprint arXiv:1504.00325,

2015. 6

[6] X. Chen and C. L. Zitnick. Learning a recurrent visual rep-

resentation for image caption generation. arXiv:1411.5654,

2014. 1

8

Results

Qualitative results on Hollywood Movies

11

7

(1) (2) (3) (4) (5) (6a) (6b)

Thanks

Subhashini

Venugopalan Huijuan

Xu

Jeff

Donahue

Marcus

Rohrbach

Raymond

Mooney Trevor Darrell

UT Austin UMass Lowell UC Berkeley UC Berkeley UT Austin UC Berkeley

References [1] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to gener- ate

natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational

Linguistics (COLING), August 2014.

[2] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor

Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and

zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV).

[3] Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Subhashini Venugopalan, Huijun Xu, Jeff

Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015

[4] Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Jeff Donahue, Lisa Hendricks, Sergio

Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.

[5] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko; arXiv 2015

Sergio

Guadarrama

Google

http://jeffdonahue.com/

http://www.eecs.berkeley.edu/~trevor

http://www.eecs.berkeley.edu/~sguada/

Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Science

Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)