Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)
-
Upload
anton-konushin -
Category
Science
-
view
179 -
download
2
Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)
Modeling Images, Videos and Text
Using the Caffe Deep Learning Library
(Part 1)
Kate Saenko
Microsoft Summer Machine Learning School, St Petersburg 2015
about me
BOSTON, Massachusetts
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Machine Learning: What is it?
• Program a computer to learn from experience
• Learn from “big data”
Machine Learning: It is used in more ways than you think!
Computer Vision: Teach Machine to “See” Like a Human
Terminator 2
Hollywood version…
terminator 2, enemy of the state (from UCSD “Fact or Fiction” DVD)
Computer Vision: Teach Machine to “See” Like a Human
Computer Vision in Real Life:
Face Tagging in Social Media
Computer Vision in Real Life: Surveillance and Security
• Stanford/Google one of the first to develop self-driving cars
• Cars “see” using many sensors: radar, laser, cameras
Computer Vision in Real Life:
Smart Cars
Computer Vision in Real Life:
Scientific Images
Image guided surgery
Grimson et al., MIT 3D imaging
MRI, CT
slide by S. Seitz
Computer Vision in Real Life:
Medical Imaging
http://www.robocup.org/
NASA’s Mars Spirit Rover
http://en.wikipedia.org/wiki/Spirit_rover
slide by S. Seitz
Computer Vision in Real Life:
Robot Vision
Computer Vision in Real Life:
many other applications! 3D Shape Analysis
3D Face Reconstruction
http://grail.cs.washington.edu/projects/totalmoving/
Handwriting Recognition
3D Panoramas
How Do We Do It?
Computer Vision: Machine Learning from Big Data
Artificial Neural Network
Support Vector Machine
Machine Learning from Big Data: Achievements
Artificial Neural Network
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
http://arxiv.org/abs/1411.4389
Image Description
Video Description
Output: A woman shredding chicken in a kitchen
Input
video:
Social media analysis
A person dancing in a studio Machine sharpening a pencil
Ballerina dancing on stage Man playing guitar Woman chopping onion
Train passing by Mt. Fuji
Petabytes of video, very
little text
Social media: retrieval
Show me all video clips of a person playing guitar and singing
Social media: summarization
A car is driving
down a road A man is riding a
bike through the
woods
A skateboarder
jumps and falls
Surveillance and Security
Smart camera alerts:
A woman wearing a
red coat walked past
A woman carrying a
large bag entered a
building
Question answering
How many times did Darth Vader use a light saber?
Assistive technology
Descriptive Video Service (DVS)
Object/action detection not enough
• Does not model interaction
between entities and scene
• Does not model what is
important to say
• Natural language is much
richer
Challenges
Object detection • YouTube dataset has 900+ objects
• Most test objects do NOT appear in training
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Dealing with uncertainty in
“in-the-wild” YouTube video
Guadarrama, Krishnamoorthy, Malkarnenkar, Venugopalan, Mooney, Darrell,
and Saenko. 2013. Youtube2text: Recognizing and describing arbitrary
activities using semantic hierarchies and zero-shot recognition. In IEEE
International Conference on Computer Vision (ICCV).
A is a .
SUBJECT VERB OBJECT person ride motorbike
<SUBJECT> <VERB> <OBJECT> A person is riding a motorbike .
A template model
OBJECT DETECTIONS
cow 0.11 person 0.42
table 0.07
aeroplane 0.05 dog 0.15
motorbike 0.51 train 0.17
car 0.29
SORTED OBJECT DETECTIONS
motorbike 0.51
person 0.42
car 0.29
aeroplane 0.05
… …
VERB DETECTIONS
hold 0.23 drink 0.11
move 0.34
dance 0.05 slice 0.13
climb 0.17 shoot 0.07
ride 0.19
SORTED VERB DETECTIONS
move 0.34
hold 0.23
ride 0.19
dance 0.05
… …
SORTED VERB DETECTIONS
move 0.34
hold 0.23
ride 0.19
dance 0.05
… …
motorbike 0.51
person 0.42
car 0.29
aeroplane 0.05
… …
SORTED OBJECT DETECTIONS
Vision pipeline
Input
Video:
Output sentence: “Woman sharpens baby”
Problem: detection mistakes
SORTED VERB DETECTIONS
sharpen 0.34
cut 0.23
… …
woman 0.51
baby 0.42
… …
SORTED OBJECT DETECTIONS
Idea: trade off accuracy and specificity
Learn hierarchies from S, V, O co-occurrence
input
video
being
person animal
…
most specific prediction
“Woman sharpens baby”
do
work play
…
baby …
knife … … …
sharpen clamp woman man … …
entity
person tool
… …
“Man clamps knife”, “Person clamps knife, “Man working” Humans:
Subject Verb Object
input
video
being
person animal
…
do
work play
…
baby …
knife … … …
sharpen clamp woman man … …
entity
person tool
… …
“Man clamps knife”, “Person clamps knife, “Man working” Humans:
Subject Verb Object
our prediction
“Person working with a tool”
collected for paraphrase and machine translation
2,089 YouTube videos with 122K multi-lingual descriptions.
Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/
Microsoft YouTube Dataset (Chen & Dolan, ACL 2011)
Microsoft YouTube Dataset (Chen & Dolan, ACL 2011) (a) Hollywood (8 actions) (b) TRECVID MED (6 actions)
(c) YouTube (218 actions)
A train is rolling by.A train passes by Mount Fuji.A bullet train zooms through the countryside.A train is coming down the tracks.
A man is sitting and playing a guitarA man is playing guitarStreet artists play guitar.A man is playing a guitar.a lady is playing the guitar.
A woman is cooking onions.Someone is cooking in a pan.someone preparing somethinga person coking.racipe for katsu curry
A girl is ballet dancing.A girl is dancing on a stage.A girl is performing as a ballerina.A woman dances.
We cluster words to obtain about 200 verbs and 300 nouns.
Video Collection Task
• Asked Amazon Mechanical Turk workers to submit video clips from YouTube
• Single, unambiguous action/event
• Short (4-10 seconds)
• Generally accessible
• No dialogue
• No words (subtitles, overlaid text, titles)
Generalization results on MSFT Youtube
Generalization results on MSFT Youtube
Challenges
Object detection • YouTube dataset has 900+ objects
• Most test objects do NOT appear in training
Need to model language • Syntax, semantics, common sense
• can a squirrel drive a car? an onion play guitar?
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Modeling common
sense J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney.
Integrating language and vision to gener- ate natural language descriptions of
videos in the wild. In Proceedings of the 25th International Conference on
Computational Linguistics (COLING), August 2014.
Input
Video:
• Output sentence: “Woman sharpens baby”
• Common sense: babies cannot be
sharpened
• Idea: learn common SVO statistics from
very large text-only corpora
?
Problem: no common sense
Adding a linguistic prior using a
Factor Graph Model (FGM)
Visual confidence values are observed (gray potentials) and inform sentence components.
Language potentials (dashed) connect latent words between sentence components.
Mining Text Corpora Corpora Size of text parsed
British National Corpus (BNC) 1.5GB
GigaWord 26GB
ukWaC 5.5GB
WaCkypedia_EN 2.6GB
GoogleNgrams 1012 words
Stanford dependency parses from first 4 corpora used to build
SVO language model.
Full language model used for surface realization trained on
GoogleNgrams using BerkeleyLM
Evaluation
● Subject, Verb, Object accuracy o Compare to SVO extracted from ground truth
sentences
o “A woman shredding chicken in a kitchen”
o Most Common SVO, or Any Valid SVO
S V O
Results of SVO prediction
HVC: highest vision confidence
GFM: factor graph with language prior
Binary accuracy of “Most Common” SVO
Challenges
Object detection • YouTube dataset has 900+ objects
• Most test objects do NOT appear in training
Need to model language • Syntax, semantics, common sense
• can a squirrel drive a car? an onion play guitar?
Sequence-to-sequence • Both input AND output are sequences
• So far we have assumed video and sentence are both fixed length
• What are good features? Can we learn them?
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Neurons in the Brain
Artificial Neural Network
Neuron in the brain
“Input wire”
“Output wire”
0
-2
+4
0
+2
Input
Multiply by
weights
Sum Threshold
Output
Artificial Neuron
+4
+2
0
+3
-2
0
-2
+4
0
+2
-8 0
Artificial Neuron: Activation
Input
Multiply by
weights
Sum Threshold
+4
+2
0
+3
-2
0
-2
+4
0
+2
-8 0
0
-2
0
+2
+2
0
-2
+4
0
+2
+8 +8
Artificial Neuron: Activation
Input
Multiply by
weights
Sum Threshold
Neurons learn
patterns!
1
Artificial Neuron: Pattern Classification
Input
+4
+2
0
-3
-2
class
0
Input
+4
+2
0
-3
-2
class
va
lues d
ecre
ase
va
lues d
ecre
ase
• Classify input into
class 0 or 1
• Teach neuron to
predict correct
class label
Example
+4
+2
0
-3
-2
0
-2
+4
0
+2
-4 0
Artificial Neuron: Learning
Input
Multiply by
weights
Sum Threshold
1
activation class
Adjust
weights
= =
+4
+2
0
-3
-2
-2
+4
0
+1
0 0
Artificial Neuron: Learning
Input
Multiply by
weights
Sum Threshold
1
activation class
+1
Adjust
weights
= =
+4
+2
0
-3
-2
0
+4
0
+1
+2 1
Artificial Neuron: Learning
Input
Multiply by
weights
Sum Threshold
1
activation class
+1
Adjust
weights
= =
Input
Output
Weights
Artificial Neural Network
Simplify
Input
Output
Artificial Neural Network Hidden Layer Input Layer Output Layer
Deep Network: many
hidden layers!
Single Neuron Neural Network
Artificial Neural Network Hidden Layer Input Layer Output Layer
x1
x2
x3
x4
x5
a1
a2
a3
h1
h2
h3
𝑥 =
𝑥1…𝑥5
Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35
ℎΘ(x) = 𝑔(Θ(2)𝑎)
a = 𝑔(Θ(1)𝑥)
Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33
weights
input
hidden layer activations
𝑔 𝑧 =1
1 + exp(−𝑧)
output
1
0.5
0
Artificial Neural Network
𝑥 =
𝑥1…𝑥5
Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35
ℎΘ(x) = 𝑔(Θ(2)𝑎)
a = 𝑔(Θ(1)𝑥)
Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33
weights
input
hidden layer activations
𝑔 𝑧 =1
1 + exp(−𝑧)
output
1
0.5
0
ℎΘ(x) = estimated probability that class=1
for input x
𝑔 𝑧
𝑧
ℎΘ x = 0.2
ℎΘ x = 0.8
predict class=0
predict class=1
Layer 3 Layer 1 Layer 2 Layer 4
Network architectures
Recurrent
Convolutional
input
input
inp
ut
hid
de
n
hid
de
n
hid
den
ou
tpu
t
ou
tpu
t
ou
tpu
t time
Representing Images
Input Layer
Reshape into
a vector
1 0 -1
1 0 -1
1 0 -1
Convolve with Threshold
Convolutional Neural Network
Input Layer
Output Layer
slide by Abi-Roozgard
w13 w12 w11
w23 w22 w21
w33 w32 w31
1 0 -1
1 0 -1
1 0 -1
Convolve with Threshold
w13 w12 w11
w23 w22 w21
w33 w32 w31
Convolutional Neural Network
slide by Abi-Roozgard
Input Layer
Output Layer
Convolutional Neural Network
LeNet
Why Deep Learning? The Unreasonable Effectiveness of Deep Features
Rich visual structure of features deep in hierarchy.
[R-CNN]
[Zeiler-Fergus]
Maximal activations of pool5 units
conv5 DeConv visualization
PART I
INTRODUCTION
THE VISUAL DESCRIPTION PROBLEM
MODELING IMAGES
MODELING LANGUAGE
INTRO TO NEURAL NETWORKS
VIDEO-TO-TEXT NEURAL NETWORK
PART II
INTRO TO CAFFE
CAFFE IMAGE AND LANGUAGE MODELS
Deep Convolutional-
Recurrent Network Translating Videos to Natural Language Using Deep Recurrent Neural
Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus
Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
Long-term Recurrent Convolutional Networks for Visual Recognition and
Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus
Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
Deep Convolutional Neural Networks
• Learn robust high-level image representations
• achieve state-of-the-art results on image tasks
• But, do not handle sequences
• Idea: combine with Recurrent Neural Network
Recurrent
Convolutional
inp
ut
inp
ut
inp
ut
hid
den
hid
den
hid
den
ou
tpu
t
ou
tpu
t
ou
tpu
t
time
Contributions
● End-to-end deep video description o deep image and language model
● Leverage still image caption data
Recurrent neural networks
• Distributed hidden state stores
past information efficiently
• Non-linear dynamics
• with enough neurons and time,
can compute any function
input
input
input
hid
den
hid
den
hid
den
outp
ut
outp
ut
outp
ut
time
based on slide by Geoff Hinton
Model
CNN
1. Extract deep features from each frame
2. Create a fixed length vector representation of the video
input frames
Sentence
3. Decode the vector to a sentence
input
input
input
hid
den
hid
den
hid
den
outp
ut
outp
ut
outp
ut
RNN
● Easier to train than RNN
● Impressive results for
○ speech recognition
○ handwriting
recognition
○ translation
Background - LSTM Unit
● Our model o 2 layers of LSTM units (hidden state
of first is input to second)
o Output- Softmax - probability
distribution over vocabulary of words
Input Video Convolutional Net Recurrent Net Output
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
A
. . .
boy
is
playing
a
ball
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
mea
n p
oo
ling
CNN
CNN
CNN
CNN
CNN
. . .
Evaluation
● Subject, Verb, Object accuracy o SVO extracted from the generated sentence
o “A woman shredding chicken in a kitchen”
o Most Common or Any Valid
● BLEU
● METEOR
● Human Evaluation
S V O
Results - SVO (Binary, Most Common)
Results - Generation
Model BLEU METEOR
FGM 13.68 23.9
LSTMFlickr 10.29 19.52
LSTMCOCO 12.66 20.96
LSTM-YT 31.19 26.87
LSTM-YTFlickr 32.03 27.87
LSTM-YTCOCO 33.29 29.07
LSTM-YTCOCO+Flickr 33.29 28.88 More fluent, but not enough training data in Youtube dataset
to train good language model
Idea: pre-learn on still image captions
Dataset Train Validation Test
Flickr30k ~28000 1000 1000
COCO2014 82783 40504 -
YouTube 1200 100 670
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
Pre-Train on Still Images (Flickr30k and COCO2014)
A
man
is
scaling
a
cliff
Input Image Convolutional Net Recurrent Net Output
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Input Image Convolutional Net Recurrent Net Output
mea
n p
oo
ling
Input Video Convolutional Net Recurrent Net Output
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
A
. . .
boy
is
playing
a
ball
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
CNN
CNN
CNN
CNN
CNN
. . .
Fine-Tune on Videos
Results - SVO (Binary, Most Common)
See also Guadarrama et al. ICCV 2013
Results - Generation
Model BLEU METEOR
FGM 13.68 23.9
LSTMFlickr 10.29 19.52
LSTMCOCO 12.66 20.96
LSTM-YT 31.19 26.87
LSTM-YTFlickr 32.03 27.87
LSTM-YTCOCO 33.29 29.07
LSTM-YTCOCO+Flickr 33.29 28.88
Results - Human Eval.
Model Relevance Grammar
FGM 3.20 3.99
LSTM-YT 2.88 3.84
LSTM-YTCOCO 2.83 3.46
LSTM-YTCOCO+Flickr - 3.64
GroundTruth 1.10 4.61
Generated image captions
http://arxiv.org/abs/1411.4389
http://arxiv.org/abs/1411.4389
Generated image captions
Sequence-to-Sequence
Video-to-Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond
Mooney, Trevor Darrell, Kate Saenko; arXiv 2015
Sequence to Sequence Video to Text
LSTM
111
MSVD dataset (youtube videos)
Results
11
2
Movie description datasets
Qualitative results Results
11
3
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
ICCV
#1211
ICCV
#1211
ICCV 2015 Submission #1211. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 3. MSVD YouTube video dataset. We present examples where S2VT model (RGB on VGG net) generates correct descriptions
involving different objects and actions for several videos (in column a). The center column (b) shows examples where the model predicts
relevant but incorrect descriptions. The last column (c) shows examples where the model generates descriptions that are irrelevant to the
event in the video.
Figure4. M-VAD Moviecorpus: Weshow arepresentativeframefrom 6 contiguousclips from themovie “BigMommas: LikeFather, Like
Son”. Soft Attention (GNet +3D-Conv) aresentences from themodel in [40]. S2VT (MPII+MVAD) represents thesentences generated by
our model trained on both the MPII and M-VAD datasets. DVS represents the original ground truth sentences in the dataset for each clip.
[4] D. L. Chen and W. B. Dolan. Collecting highly parallel data
for paraphrase evaluation. In ACL, Portland, Oregon, USA,
June 2011. 2, 5
[5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-
lar, and C. L. Zitnick. Microsoft coco captions: Data collec-
tion and evaluation server. arXiv preprint arXiv:1504.00325,
2015. 6
[6] X. Chen and C. L. Zitnick. Learning a recurrent visual rep-
resentation for image caption generation. arXiv:1411.5654,
2014. 1
8
Results
Results
Results
Qualitative results on Hollywood Movies
11
7
(1) (2) (3) (4) (5) (6a) (6b)
Thanks
Subhashini
Venugopalan Huijuan
Xu
Jeff
Donahue
Marcus
Rohrbach
Raymond
Mooney Trevor Darrell
UT Austin UMass Lowell UC Berkeley UC Berkeley UT Austin UC Berkeley
References [1] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to gener- ate
natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational
Linguistics (COLING), August 2014.
[2] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor
Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and
zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV).
[3] Translating Videos to Natural Language Using Deep Recurrent Neural Networks. Subhashini Venugopalan, Huijun Xu, Jeff
Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015
[4] Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Jeff Donahue, Lisa Hendricks, Sergio
Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.
[5] Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko; arXiv 2015
Sergio
Guadarrama