Learning Visual Representations using Images with Captionstrevor/learning-workshop07.pdf · Trevor...
Transcript of Learning Visual Representations using Images with Captionstrevor/learning-workshop07.pdf · Trevor...
Learning Visual Representations using Images with Captions
Ariadna QuattoniMichael CollinsTrevor Darrell
Training visual classifiers when a few examples are available
Problem:Image classification from a few examples can be hard.A good representation of images is crucial.
Solution: Use available resources to learn a good image representation.
Semi-supervised learning
Available resource:large dataset of unlabeled data
Unsupervised learning
h:F I R→
Visual Representation
Small training set of labeled images
h:F I R→ Train a classifier
h-dimensional training set
Learning visual representations Using unlabeled data
Unsupervised learning in data spaceGood thing:
Lower dimensional representation preserves relevant statistics of the data sample.
Bad things:The representation might not capture the relevant information for an image classification problem.Appearance of a visual component may be heterogeneous and hard to discover.
Learning visual components
Want a representation based on visual components.
Visual features that are useful to group together for classification, e.g.:
teams have: people, medals…crowds have: people, pavement…
Appearance of a visual component may be heterogeneous
Learning visual representations
Unlabeled + metadata: Images with associated natural language captions.Video sequences with associated speech.
How could the metadata help?A hint for discovering important features.
Feature selection paradigm:Use the metadata to define “auxiliary tasks”.Discover feature groupings that are useful for these tasks.
OverviewAvailable resource:
large dataset of images and captions
Create auxiliaryproblems
Perform structural learning h:F I R→
Visual Representation
Small training set of labeled images
h:F I R→ Train a classifier
h-dimensional training set
Learning visual representations:Learning from auxiliary tasks
[Ando & Zhang, JMLR 2005]Classification problem, core task:
Core training set
Unlabeled set
A set of auxiliary problems related to the core taskA method for generating a collection of auxiliary training sets for the auxiliary problems
{ 1,1}Y = −
)},()...,,(),,{( 2211 nncore yxyxyxD =
},...,,{ 21 uxxxU =
1 2{ , ,..., }mD D D
:F X Y→ dX R∈
Learning visual representations Using images with captions
Core task: news topic prediction.Learn a function:
is an image and if
: { 1,1}j
F X → −
Xx∈ 1=y jx topic∈
News domain
figure skating ice hockey golden globes
grammys
Dataset: News images from Reuters web-site.Problem: Predicting news topics from images.
Learning visual representations Using images with captions
The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskatingat Oval Lingotto during the 2006 Winter Olympics.
Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad.
Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine.
Senior Hamas leader KhaledMeshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006.
Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet,
U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th BerlinaleInternational Film Festival.
Auxiliary task: predict “ team ” from image contentAuxiliary task: predict “ team ” from image content
Learning visual representations Using images with captions
)},)....(,{( 11 uu cxcxU =
)...,( 21 mwww
Image caption pairs:
m content words:
1),( =cxp j
1 1 1{( , ( , )),...( , ( , ))}j j ju u uD x p x c x p x c=
m auxiliary problems
if word appears in caption c and -1 otherwise
Auxiliary training sets:jw
Structural Learning Train classifiers for auxiliary tasks.
Step 1: Train m linear classifiers setting optimal parameters for to be: jD
( , )i i jx y D∈
We use logistic loss in our experiments.
* 2arg min ( ( , ), ) || ||2j w i i
i
Cw l f w x y w= +∑
, is the auxiliary training setjD thj
Structural Learning Manifold Learning in classifier space
Step 2: Perform SVD on Parameters Vectors Matrix.
Compute by taking the first h eigenvectorsof
defines a linear subspace of dimension h; a good approximation to the space of weights.
1 2
* * *[ , ,... ]m
w w w=W
θ
dhR ×∈θtWW
Structural Learning: manifold learning in classifierweight space.
Structural Learning Training on the core task
( , )i i corex y D∈
( ) 'q x xθ=
* 2arg min ( ( , ( )), ) || ||2v i i
i
Cv l f v q x y v= +∑
Project data
Equivalent to training core task on the original d dimensional space with parameters constraints:
Step 3: Train core classifier
** vw tcore θ=
Structural Learning Minimizing a joint loss
Class of linear predictors:is an h by d structural parameters matrix.
Goal: Find the parameters and shared that minimizes the joint loss.
( , ) 'tf v x v xθ θ=θ
v θ
1:
( , , ) ( ) ( )j j jj m
L v D reg v regθ θ=
⎡ ⎤+ +⎣ ⎦∑
jDLoss on training set
Problem specific parametersShared parameters
Toy Example
Object = { letter, letter, letter }
An object
AbC
Toy Example
The same object
Abc
Toy Example
The same object
ABc
Toy Example
The same object
abC
Toy example Discovering visual components
acJ 001000000000001 ... 10000
A C JB
10 visual components10 visual components5 appearances eac5 appearances each
task: recognize object .
objects
“ABC” object “ADE” object
“BCF” object “ABD” object
Toy example Discovering components
Plots show 50 points corresponding to the appearances in the model.
PCA in data space Structural learning:PCA on weights of object classifiers
• Find interest points in .• Map each point to its closest cluster.
• Find interest points (Sift detector) • Map each point to a feature vector (Sift descriptor)• Do vector quantization giving d clusters.
Experiments Baseline image representation
)(Ig
XIg →:dRX =
1 2( ) [# ,# ,..., # ]dg I c c c=
I
Bag of words sift representation, ‘raw’ representation:
Preprocessing
Compute
( )g I :
1000 images as development data labeled for a single topic.
Experiments Dataset
nL
10576 images, 130 topics .Predict 15 most frequent topics.
Reuters Dataset
Data Partitions
8000 unlabeled images with captions but no topic labels.
labeled training sets of sizes: 5, 10, 20 ,…320.
1756 images as testing data.
Experiments Model
Three models: Logistic regression classifiers.Baseline model:
Uses raw representation. Regularization constant C optimized on development set.
PCA model: Uses PCA-Unlabeled representation.C and h optimized on the development set of a single selected topic.
Structure learning model:Uses structural learning representation.C and h optimized as for the PCA model.
Experiments Results
5 10 20 40 80 160 3200.45
0.5
0.55
0.6
0.65
0.7
0.75
# training examples
Ave
rage
Equ
al E
rror R
ate
Reuters Dataset: 14 topics
Baseline ModelPCA ModelStructural Learning
Conclusion
Summary:We described a method for learning visual representations from images captions.Use captions and auxiliary tasks to induce a representation that captures visual components.The induced representation enables learning from fewer examples.
Future work:Exploring ways of determining the relatedness between core and auxiliary tasks.Performing structural learning from more weakly related auxiliary tasks.
Thanks!